To be completed by the go-between with help from the Ri-Rep.
Cover the stages of the data life-cycle in which the RI is involved, that pertain to the <topic> with references to more detail if the RI has them. Include quantitative and timeliness information, intended uses and so on - if such information is available.
Insert a summary of the main requirements for this RI for the current topic. Point out any unusual features, and comment on the extent to which these requirements are fixed or evolving. |
1. Data processing desiderata: input
i. What data are to be processed? What are their:
ii. How is the data made available to the analytics phase? By file, by web (stream/protocol), etc.
Normally the data is made available to analytics based on a local or mounted file system. A separate data-import step is responsible for filling up the input data pool.
iii. Please provide concrete examples of data.
Temperature and precipitation according to various scenarios, generated by different climate models. Compute statistics to compare characteristics of the different climate models or climate indices – characterizing the individual climate model performance.
2. Data processing desiderata: analytics
i. Computing needs quantification:
Most processing would benefit from a parallel map reduce phase, where first distributed data near pre-processing is done, reducing the amount of data to be transferred. Thereafter more complex, shared disk/memory parallel analytics is done on the parts from the map-reduce phase.
Some analysis use cases can benefit from shared memory and distributed memory parallelism to accelerate time to solution. Also to notice: some analysis phase are well suited for parallel approach (such as one process per model for example)
ii. Process implementation:
iii. Do you use batch or interactive processing? Both.
iv. Do you use a monitoring console? Yes.
v. Do you use a black box or a workflow for processing?
The choice of workflow engine analysis is project or framework specific, e.g. proprietary workflow wrappers, dispel4py.
vi. Please provide concrete examples of processes to be supported/currently in use;
Simple: Subsetting of data, mean etc. statistics, downscaling of data, interpolation of data, climate indices calculation (ENSO, NAO, PDO, etc).
-Complex: vegetation modeling, geographical mosquito dispersal
3. Data processing desiderata: output
i. What data are produced? Please provide:
ii. How are analytics outcomes made available? Different means: only per researcher or research group on file system, some outputs are published in catalogues and accessible via web and python notebook for example.
4. Statistical questions
i. Is the data collected with a distinct question/hypothesis in mind? Or is simply something being measured?
Data is collected according to the requirements and pre-defined characteristics defined for climate model intercomparison projects.
5. Will questions/hypotheses be generated or refined (broadened or narrowed in scope) after the data has been collected? (N.B. Such activity would not be good statistical practice)
The requirements and characteristics are refined after every round of model intercomparison projects to improve the next round and to be react on the new possibilities new technical infrastructures provide (e.g.improved processing power to support larger ensembles and finer resolution in models).
6. Statistical data
i. Does the question involve analysing the responses of a single set of data (univariate) to other predictor variables or are there multiple response data (bi or multivariate data)? Depending on analysis activity.
ii. Is the data continuous or discrete? Discrete.
iii. Is the data bounded in some form (i.e. what is the possible range of the data)?
Data represents several hundreds physical quantities (temperature, precipitation, wind speed, etc.) and in that sense are bound by physical laws.
iv. Typically how many datums approximately are there?
Data are stored on grid point covering the entire Earth System influencing the climate (atmosphere, ocean, sea ice, land…). So there are several thousands of data.
7. Statistical data analysis
i. Is it desired to work within a statistics or data mining paradigm? (N.B. the two can and indeed should overlap!)
Statistics are very important in climatic analysis as we are looking after robust and significant signal.
ii. Is it desired that there is some sort of outlier/anomaly assessment? Yes – but on a Petabyte scale difficult to achieve.
iii. Are you interested in a statistical approach which rejects null hypotheses (frequentist) or generates probable belief in a hypothesis (Bayesian approach) or do you have no real preference?
Needs more details. Yes a priori. The range of scientific analysis done using the data of our RI is very large. But those complex analyses are usually done within the scientific teams not by the RI itself.
| Go-between | Yin Chen |
|---|---|
| RI representative | Sylvie Joussaume <sylvie.joussaume@lsce.ipsl.fr> Francesca Guglielmo <francesca.guglielmo@lsce.ipsl.fr> |
| Period of requirements collection | Oct -Nov 2015 |
| Status | Completed |
Add additional rows to the above table if you have covered this topic with this RI by holding discussions with several people, or if you have delegated some discussions; to show the full authorship and duration.