To be completed by the go-between with help from the Ri-Rep.
Cover the stages of the data life-cycle in which the RI is involved, that pertain to the <topic> with references to more detail if the RI has them. Include quantitative and timeliness information, intended uses and so on - if such information is available.
Insert a summary of the main requirements for this RI for the current topic. Point out any unusual features, and comment on the extent to which these requirements are fixed or evolving. |
1. What granularity do your RI’s data products have:
We store a time series of each variable in a simulation run at given sampling frequency (yearly, monthly, day, sub-daily). Spatially we cover 1) the globe by gridpoints 2) regions like Europe, Africa…
2. How are the data products of your RI stored - as separate “static” files, in a database system, or a combination?
Metadata catalogue; data on disk by variable (ESGF), some data on tape (LTA).
3. How does your RI treat the “versioning” of data - are older datasets simply replaced by updates, or are several versions kept accessible in parallel? How do you identify different version of the same dataset?
ESGF: several versions are kept in parallel on some reference nodes. Versions applied at the dataset level and contain several files pertaining to given variable or set of variables. New version are store in new directory. LTA: Version info is part of MD.
4. Is it important to your data users that
Yes.
Yes, it does already.
Some Metadata only.
Yes.
5. Is your RI currently using, or planning to use, a standardized system based on persistent digital identifiers (PIDs) for:
6. Please indicate the kind of identifier system that are you using - e.g. Handle-based (EPIC or DOI), UUIDs or your own RI-specific system?
EPIC and DOI.
7. If you are using Handle-based PIDs, are these handles pointing to “landing pages”? Are these pages maintained by your RI or an external organization (like the data centre used for archiving)?
Landing pages maintained by DKRZ
8. Are costs associated with PID allocation and maintenance (of landing pages etc.) specified in your RI’s operational cost budget?
Yes
1. How does your “designated scientific community” (typical data users) primarily use your data products? As input for modelling, or for comparisons?
As climate model input, for analysis and for comparison.
2. Do your primary user community traditionally refer to datasets they use in publications:
DOIs are available for the most important data products like CMIP5 and CORDEX. Data is ready to be cited in the reference section, but it is not yet usual to do so.
3. Is it important to your data users to be able to refer to specific subsets of the data sets in their citation? Examples:
We recommend citing a dataset collection and specifying the used subset in the text. The above-mentioned subsets are possible in any combination as well as combining specific subsets over multiple dataset collections i.e. citation entities.
4. Is it important to be able to refer to many separate datasets in a collective way, e.g. having a collection of “all data” from your RI represented by one single DOI?
See iii: A collection is suitable to be used in a reference list to keep the balance between data and paper citations.
5. What strategy does your RI have for collecting information about the usage of your data products?
Scientific “impact”
Downloads/access requests: by number and volume with continental information on user origins (for DKRZ visualised on the DKRZ-Website).
References in scientific literature, Scientific “impact”: establish data references as part of the scientific record.
6. Who receives credit when a dataset from your RI is cited?
The creator(s) as specified by the data originator; creators might be persons or institutions.
7. What steps in tooling, automation and presentation do you consider necessary to improve take up of identification and citation facilities and to reduce the effort required for supporting those activities?
Not mentioned above is the identification of creators by PIDs like ORCID or the relation/connection to a scientific publication. Earth System Sciences data is of high volume; therefore data is hosted at established archival centers. Certificates like Data Seal of Approval (DSA) and World Data System (WDS) approval are of growing importance. Usually we have so-called ‘stand-alone’ data publications not directly connected to or supplemented to an article. Most of the data users publishing articles are not identical with the data creators.
We currently work on a stable and reliable possibility to cite dynamic data (CMIP6) in a federated data infrastructure.
| Go-between | Yin Chen |
|---|---|
| RI representative | Name |
| Period of requirements collection | From start month year to end month year |
| Status |
Add additional rows to the above table if you have covered this topic with this RI by holding discussions with several people, or if you have delegated some discussions; to show the full authorship and duration.