페이지 트리

QUESTIONS

Version 2: 18 August 2015
Cristina-Adriana Alexandru, Rosa Filgueira Vincente, Alex Vermeulen, Keith Jeffery, Thomas Loubrieu, Leonardo Candela, Paul Martin, Barbara Magagna, Yin Chen and Malcolm Atkinson

 

Starting remark:

Our RI IS-ENES runs a distributed, federated data infrastructure based on few (3-4) main data centres and various associated smaller ones.

With our answers in the following we refer to the climate modeling community, to two data dissemination systems (ESGF for project run time; LTA as long term archiving), to CMIP5 as climate modelling data project 2010-2015 and CMIP6 2016-2021.

 

ENVRIplus topics: (ENVRIplus development areas address the following topics):

1.     Identification and citation

2.     Curation

3.     Cataloguing

4.     Processing

5.     Provenance

6.     Optimization

7.     Community support

 

A. Generic questions (for all topics)

1.     What is the basic purpose of your RI, technically speaking?

a.     Could you describe one or several basic use-cases involving interaction with the RI that cover topics 1-7

 

Use case 1: Data producers submit model data results to IS-ENES data nodes. Data is quality-checked and published in IS-ENES/ESGF data infrastructure. All data items are uniquely identified. Data is long term archived. Data aggregates (experiment level) are assigned DOIs. DOIs are used by end users in scientific publications. DOI-assigned data aggregates are published in various Metadata Catalogues e.g. in world data centers for climate.

 

Use case 2: End user of IS-ENES data infrastructure encounters problems (technical or scientific). End user contacts IS-ENES/ESGF user support (organized in first/second level support, second level support internationally distributed). General problems are documented in FAQ site.

 

Use case 3: End user wants to process large amounts of data. Three possibilities to do this:

A) Download and process at home institute. This is supported via a bulk data download and synchronization tool for IS-ENES/ESGF sites.

 

B) Contact a large IS-ENES/ESGF site who already has the required data available (replicated from other sites) and process there (personal interaction necessary to get account and permission at the site). This is supported by the user support service.

 

C) Contact a web processing service or a portal providing (parts of) the requested analysis functionalities. This supported by the IS-ENES climate4impact portal (https://climate4impact.eu/) as well as by IS-ENES/ESGF web processing services (not yet fully in production)

 

 

 

Some more detailed IS-ENES use cases were submitted to the RDA (Research Data Alliance) Data Fabric interest group as well as Data Repository interest group and are available at:

https://rd-alliance.org/enes-data-federation-use-case.html and https://rd-alliance.org/climate-data-analytics-use-case.html .

 

b.     Could you describe how data is acquired, curated and made available to users?

 

Data is generated by climate modeling groups (as well as by some climate observational studies, relevant for climate model intercomparison projects). Data is post-processed according to the standards and agreements of the intercomparison project (e.g. CMIP, CORDEX). Data is ingested at IS-ENES/ESGF data nodes and quality-controlled (check intercomparison project conventions and standards). As a next step, data is published to the IS-ENES/ESGF data infrastructure. Publication makes metadata available and searchable and data accessible via IS-ENES portals (as well as via APIs). Important data products are replicated to dedicated long-term archival centers. There, additional quality checks are run as a pre-requisite for DOI assignment and availability for DOI based data citation.

 

c.     Could you describe the software and computational environments involved?

 

The post-processing of the data according to the standards and conventions of intercomparison projects is supported by community tool (CMOR). The infrastructure is based on a large international open source community (The Earth System Grid Federation, ESGF), developing the individual components (security, catalogues, data access services, portal parts etc.). The computational environments are more heterogeneous and organized locally at sites according to site-specific constraints. Some computational facilities are integrated as part of the ESGF nodes and portals (simple sub-setting and visualization) or IS-ENES portals interfacing with the IS-ENES data infrastructure (e.g. the climate4impact portal). Larger computational services are exposed via Web Processing Services –this part is not yet in production and needs technical developments as well as future organizational/policy agreements.

 

 

d.     What are the responsibilities of the users who are involved in this use case?

 

Data producers:

Deliver data (and metadata) according to the rules and regulations of the corresponding Model Intercomparison Project (defining a kind of data management plan).

Inform data publishers about new versions and versioning related information.

 

Data publishers:

Publish data according to defined “best practices” agreed upon in the data federation.

Provide contact information in case of operational problems at the site.

Inform federation about operational issues (down times etc.).

 

Data users:

Provide citation information in published work based on the data.

 

e.     Do you have any use case involving interactions with other RIs (in the same or different domains?)

 

Data from IS-ENES is replicated to EUDAT for data curation purposes (long term archival). IS-ENES data is harvested by EUDAT metadata catalogue (B2Find). Integration of other EUDAT services is foreseen (B2Drop, ..) to support cross-community data usage. 

 

f.       What datasets are available for sharing with other RIs ? Under what conditions are they available?

 

Mostly model data generated to enable Model Intercomparison Projects: e.g. CMIP5, CORDEX. Also some observational data used for intercomparison analysis activities: e.g. obs4Mips. The diversity will grow during the next phase of intercomparison projects currently starting (CMIP6).

 

2.     Apart from datasets, does your RI also bring to ENVRIplus and/or other RIs:

a.     Software? In this case, is it open source?

 

All the components of the IS-ENES/ESGF data infrastructure are based on an international open source effort, called Earth System Grid Federation (ESGF). All the software is open source ( https://github.com/esgf ). Also the activities to provide future data near processing functionalities are organized in open source projects (see e.g. https://github.com/bird-house with documentation on http://birdhouse.readthedocs.org/en/latest/ as well as the climate4impact WPS activities).

 

b.     Computing resources (for running datasets through your software or software on your datasets)?

 

No, except for testing & prototyping.

 

c.     Access to instrumentation/detectors or lab equipment? If so, what are the open-access conditions? Are there any bilateral agreements?

 

No (as n/a).

 

d.     Users/expertise to provide advice on various topics?

 

On request we are happy to support and provide advice on basis of our running environment (ESGF).

 

e.     Access to related scholarly publications?

 

No.

 

f.       Access to related grey literature (e.g. technical reports)?

 

We support a website with information on our RI: https://is.enes.org

 

3.     What objectives would you like to achieve through participation to ENVRIplus?

 

Better understanding of interdisciplinary use cases and end user requirements.

A look on practices beyond the horizon of our community.

Example:. sharing of data management best practices.

 

4.     What services do you expect ENVRIplus technology to provide?

 

Service and Data catalogues for comparison of our model data to other data (e.g., observations).

 

5.     What plans does your RI already have for data, its management and exploitation?

a.     Are you using any particular standard(s)?

  1. Strengths and weaknesses

 

Community specific standards for data formatting and access are used:

netcdf-CF (climate and forecast conventions), OpenDAP data access protocol, Thredds. Metadata is also (partially) exposed as ISO 19139 conforming documents.

 

b.     Are you using any particular software(s)?

  1. Strengths and weaknesses

 

Federated Solr/Lucene indices to provide consistent data search across IS-ENES portals.

Thredds servers for data access (developed by unidata). Globus GridFTP for large data transfers.

 

c.     Are you considering changing the current:

  1. standard(s)

 

No.

 

  1. software

 

The software is in continuous evolution especially because of security incidents in the  past. Work in progress concerns among others a better automatic installation. The software is composed of a galaxy of components. Depending on requirements (scientific or operational) we have opportunities to make evolution on some components.

 

  1. working practices

 

Because of the problems to provide stable operational procedures across an internationally distributed data federation (supported via different (local) funding streams) an operations team was formed to support CMIP6 data management in the data federation. This team will define best practices and supervise the operational data management activities at the sites.

  as part of a future plan?

    Please provide documentation/links for all the above which apply.

Operations Team terms of reference document: https://docs.google.com/document/d/1oRWqxtWWEfsucTVhk0G3bMqHC0BL4dJwADrOG8Ukj-g/edit

 

6.     What part of your RI needs to be improved in order:

a.     For the RI to achieve its operational goals?

 

Be able to share best practices as fast as new nodes integrate the RI federation.

 

b.     For you to be able to do your work?

 

Data near processing functionality has to be provided to A) reduce the download volumes from sites and B) support end users a means to be able to work with a worldwide-distributed climate data archive in the Petabyte range.

 

7.     Do topics [1-6] cross-link with your data management plan?

a.     If so please provide the documentation/links

 

See e.g. CORDEX data management plan , CMIP6 data management preparation documents .

WDCC documents .

 

8.     Does your RI have non-functional constraints for data handling and exploitation?  For example:

a.     Capital costs

b.     Maintenance costs

c.     Operational costs

d.     Security

e.     Privacy

f.       Computational environment in which your software runs

g.     Access for scrutiny and public review

 

If so please provide the documentation/links

 

The total annual operating cost of the infrastructure is estimated to be of 1560 k€. 


 

9.     Do you have an overall approach to security and access?

 

Yes – The data infrastructure supports single sign on across multiple portals as well as authorization based on membership to various “projects”.

 

10.           Are your data, software and computational environment subject to an open-access policy?

 

CORDEX data are in general available for both commercial and research purposes. Some modelling centres decided to restrict the use of their data to “non-commercial research and educational purposes.”

https://github.com/IS-ENES-Data/cordex/blob/9fa582a72c38ad13738885c1aeadc764bc3700fa/CORDEX_register.xlsx

The access to CMIP5 data is unrestricted except for the data from Japanese modeling centres, which are subject to similar restrictions as above:

http://cmip-pcmdi.llnl.gov/cmip5/availability.html

 

 

11.           What are the big open problems for your RI pertinent to handling and exploiting your data?

 

Handling Volume and distribution of data (Multi-Petabyte range): Replication, Versioning.

Providing related information for data products (provenance, user comments, usage, detailed scientific descriptions needed for usage).

 

12.           Are you interested in any particular topic [1-6] to discuss in more detail?

a.     If so, would you like us to arrange a follow up interview with more detail questions about any particular topic to be discussed?

 

Topic 4 (processing). –  see below.

 

 

Optional: If you are not the right person to reply to some questions from the above, please suggest the right person to contact from your RI for those questions.

 

 

B. Specific questions per topic

  1. Identification and Citation
    1. Identification

i.       What granularity do your RI’s data products have:

  • Content-wise (all parameters together, or separated e.g. by measurement category)?
  • Temporally (yearly, monthly, daily, or other)?
  • Spatially (by measurement station, region, country or all together)?

 

We store a time series of each variable in a simulation run at given sampling frequency (yearly, monthly, day, sub-daily). Spatially we cover 1) the globe by gridpoints 2) regions like Europe, Africa…

 

ii.       How are the data products of your RI stored - as separate “static” files, in a database system, or a combination?

 

Metadata catalogue; data on disk by variable (ESGF), some data on tape (LTA).

 

iii.       How does your RI treat the “versioning” of data - are older datasets simply replaced by updates, or are several versions kept accessible in parallel? How do you identify different version of the same dataset?

 

ESGF: several versions are kept in parallel on some reference nodes. Versions applied at the dataset level and contain several files pertaining to given variable or set of variables.  New version are store in new directory. LTA: Version info is part of MD.

 

iv.       Is it important to your data users that

  • Every digital data object is tagged with a unique & persistent digital identifier (PID)?

 

Yes.

  • The metadata for data files contains checksum information for the objects?

 

Yes, it does already.

 

  • Metadata (including any documentation about the data object contents) is given its own persistent identifier?

 

Some Metadata only.

  • Metadata and data objects can be linked persistently by means of PIDs?

 

Yes.

 

v.       Is your RI currently using, or planning to use, a standardized system based on persistent digital identifiers (PIDs) for:

  • “Raw” sensor data?

 

n/a

  • Physical samples? 

 

n/a

  • Data undergoing processing (QA/QC etc.)?

 

Yes.

  • Finalized “publishable” data?

 

Yes.

 

vi.       Please indicate the kind of identifier system that are you using - e.g. Handle-based (EPIC or DOI), UUIDs or your own RI-specific system?

 

EPIC and DOI.

 

vii.       If you are using Handle-based PIDs, are these handles pointing to “landing pages”? Are these pages maintained by your RI or an external organization (like the data centre used for archiving)?

 

Landing pages maintained by DKRZ.

 

viii.       Are costs associated with PID allocation and maintenance (of landing pages etc.) specified in your RI’s operational cost budget?

 

Yes.

 

  1. Citation

i.       How does your “designated scientific community” (typical data users) primarily use your data products? As input for modelling, or for comparisons?

 

As climate model input, for analysis and for comparison.

 

ii.       Do your primary user community traditionally refer to datasets they use in publications:

  • By providing information about producer, year, report number if available, title or short description in the running text (e.g. under Materials and Methods)?
  • By adding information about producer, year, report number if available, title or short description in the References section?
  • By DOIs, if available, in the References section?
  • By using other information?
  • By providing the data as supplementary information, either complete or via a link

 

DOIs are available for the most important data products like CMIP5 and CORDEX. Data is ready to be cited in the reference section, but it is not yet usual to do so.

 

iii.       Is it important to your data users to be able to refer to specific subsets of the data sets in their citation? Examples:

  • Date and time intervals
  • Geographic selection
  • Specific parameters or observables

 

We recommend citing a dataset collection and specifying the used subset in the text. The above-mentioned subsets are possible in any combination as well as combining specific subsets over multiple dataset collections i.e. citation entities.

 

iv.       Is it important to be able to refer to many separate datasets in a collective way, e.g. having a collection of “all data” from your RI represented by one single DOI?

 

See iii: A collection is suitable to be used in a reference list to keep the balance between data and paper citations.

 

i.       What strategy does your RI have for collecting information about the usage of your data products?

  • Downloads/access requests
  • Visualization at your own data portal
  • Visualization at other data portals
  • References in scientific literature
  • References in non-scientific literature

Scientific “impact”

 

Downloads/access requests:  by number and volume with continental information on user origins (for DKRZ visualised on the DKRZ-Website).
References in scientific literature, Scientific “impact”: establish data references as part of the scientific record.

 

ii.       Who receives credit when a dataset from your RI is cited?

  • The RI itself
  • The RI’s institutional partners (all or in part, depending on the dataset contents)
  • Experts in the RI’s organization (named individuals)
  • “Principal investigators” in charge of measurements or data processing (named individuals)
  • Staff (scientists, research engineers etc.) performing the measurements or data processing (named individuals)

 

The creator(s) as specified by the data originator; creators might be persons or institutions.

 

What steps in tooling, automation and presentation do you consider necessary to improve take up of identification and citation facilities and to reduce the effort required for supporting those activities?

 

Not mentioned above is the identification of creators by PIDs like ORCID or the relation/connection to a scientific publication. Earth System Sciences data is of high volume; therefore data is hosted at established archival centers. Certificates like Data Seal of Approval (DSA) and World Data System (WDS) approval are of growing importance. Usually we have so-called ‘stand-alone’ data publications not directly connected to or supplemented to an article. Most of the data users publishing articles are not identical with the data creators.

We currently work on a stable and reliable possibility to cite dynamic data (CMIP6) in a federated data infrastructure.

 

 

 

  1. Curation

 

a.     Will the responsibility for your RI’s curation activities be shared with other organisations?

It is already a shared approach between various climate data centres and climate modelling centres

 

b.     Does the curation cover datasets only or also:

i.       Software?

ii.       Operating environment?

iii.       Specifications/documentation?

 

All cases: also (ii only meta information on environment).

 

c.     What is your curation policy on retaining/discarding

i.       Datasets?

 

Final project data >10yrs

 

ii.       Software?

 

Mostly no policy

 

iii.       Operating environments?

 

Mostly no policy

 

iv.       Documents?

 

When needed – actually we do not always have the time to keep everything up to date.

 

d.     How will data accessibility be maintained for the long term? E.g. What is your curation policy regarding media migration?

 

RI policies depend on the specific policies of the service centers providing LTA services: e.g. New tapes after 5 years at DKRZ

 

e.     Do you track all curation activities with a logging system?

 

At DKRZ most activities are logged but not systematically. It depends on the site in question.

 

f.       What metadata standards do you use for providing

i.       Discovery,

 

ISO, DIF, DC, THREDDS.

 

ii.       Contextualisation (including rights, privacy, security, quality, suitability...)

 

Strongly depending on the project

 

iii.       Detailed access-level (i.e. connecting software to data within an operating environment)?

 

n/a

 

Please supply documentation
 

g.     If you curate software how do you do it?   Preserving the software or a software specification?

 

SW by SVN/GitHub – ad hoc storage by some data centres.

 

h.    What provisions will you make for curating workflows or other processing procedures/protocols?

 

Storing provenance logs as part of workflow outputs is foreseen for some data evaluation workflow chains.

 

i.       If you curate the operating environment how do you do it?   Preserving the environment or an environment specification ?

 

Just specifications like OS, compiler, hardware, libraries, etc.

 

j.        What steps in tooling, automation and presentation do you consider necessary to improve take up of curation facilities and to reduce the effort required for curation?

 

Better unification of policies & interfaces and better adhering to it.

 

  1. Cataloguing

a.     Do you use catalogues or require using catalogues for the following items?

i.       Observation system

ii.       Data processing system

iii.       Observation event and collected sample

iv.       Data processing event

v.       Data product

 

Metadata catalogue for specification of data products.

 

vi.       Paper or report product

vii.       Research objects or feature interest (e.g. site, taxa, ...)

viii.       Services (processing, discovery, access, retrieval, publishing, visualization, etc.)

 

b.     For each used or required catalogue, consider the following questions:

i.       Item descriptions:

  • Which fields do you use for describing items?

 

ESGF: Use MD, in preparation: Citation MD
LTA: Use MD, Citation MD, Contacts, Rights, access&storage, in preparation: provenance.

 

  • Which standards do you apply for these fields (format or standard model)?

 

ISO, DIF, DC, etc.

  • Do you use controlled vocabularies for these fields? If so, please cite the vocabulary service providers.

 

ESGF: NetCDF-CF ( www.cfconventions.org ) and lists in central repository (remake is in progress, github.com/ES-DOC/esdoc-cim-cv and ES-DOC/esdoc-cv and other)

LTA: Just internal lists.

  • Do you maintain a cross-link or inter-links between:
    • Catalogue items (e.g. between observation system -   observation event - result dataset)?

 

Working on cross-link between data and simulation-Metadata, Model-Metadata.

LTA: cross-links to various publications (with DOI).

  • Fields for item description and actual   items (e.g.   between   dataset fields - dataset access services or   between sample fields - label on the sample)?
  • Which repositories/software do you use to manage your metadata?

 

CMS = plone, cKAN.

LTA: Oracle DB, JavaSP.
ESGF: Lucene indexing, postgres DB.

ii.       Inputs:

  • Human inputs:   Do you provide/need facilities for editors/reviewers to maintain the metadata in the catalogues (e.g. forms, validation workflow, etc.)? If so, please describe them briefly.

 

LTA: Oracle SQLdeveloper.

  • Machine inputs:   Do you use/ need automated harvesting to populate your catalogues? If so, which protocol do you use (e.g. csw, oai-pmh, other, specific)?

 

ESGF: 1) auto Metadata harvesting from netCDF file headers to DB; 2) Lucene solr cloud for Metadata aggregation and presentation to user; 3) external harvesting from DB possible.

  • How do you manage duplicates? i.e. Do you apply governance rules   in a network of catalogues, do you use   unique identifiers, or take other actions?

 

Checksums and unique IDs.

iii.       Human outputs:

  • What specific feature is provided/required in your web discovery function (multi-criteria search, graphical selection components (e.g. map, calendar), facets, keyword or natural language)?

 

ESGF: facetted search.

  • Do you evaluate the accessibility, quality, and usage of your catalogue by using a dashboard or value-added products? If so, do you provide/need:
    • Popularity or Usage feedback?
    • Any other synthesis, indicators or statistics?

 

No.

If so, please describe them shortly.

  • Is the catalogue freely readable or do you apply a specific authorization scheme? If you are applying a specific authorization scheme, please cite the authentication system (SSO) and the authorization rules.

 

All Metadata are free and open.

 

iv.       Machine outputs:

  • Do you provide/need machine interfaces for accessing your catalogues? If so, which protocols are implemented?

 

Metadata: OAI-PMH: ISO, DC, DIF, etc.

  • Do you need to fit in Applicable Regulations requirements (e.g. INSPIRE) or embedding frameworks (GEOSS, OBIS)? If so, please cite the regulation, applicable   interface requirements, impacts   (format, performance, access policy, etc.)   on your catalogues.

 

Partially e.g. to provide Metadata as part of the world data center federation

 

 

  1. Processing

a.     Data processing desiderata: input

i.       What data are to be processed? What are their:

  • Typologies

 

Hierarchical collection of data, are characterized by entries from controlled vocabularies.

  • Volume

 

Very high volume data, size of individual files ranges from mega to gigabytes, but processing is normally done at collection level involving multi terabyte input collections.

  • Velocity

 

Low velocity – data collections are growing (in a controlled manner) and new versions of existing data products are available in the data federation

  • Variety

 

Very low – data is based on highly structured data items (well defined binary data types, representing multi-dimensional date entities, e.g. netCDF). Data entities are organized in well structured hierarchies (structured according to time, variables, project characterization etc.).

 

ii.       How is the data made available to the analytics phase? By file, by web (stream/protocol), etc.

 

Normally the data is made available to analytics based on a local or mounted file system. A separate data-import step is responsible for filling up the input data pool.

 

iii.       Please provide concrete examples of data.

 

Temperature and precipitation according to various scenarios, generated by different climate models. Compute statistics to compare characteristics of the different climate models or climate indices – characterizing the individual climate model performance.

 

b.     Data processing desiderata: analytics

i.       Computing needs quantification:

  • How many processes do you need to execute?

 

Highly dependent on use case. Computing is more I/O bound then processor bound.

  • How much time does each process take/should take?

 

Also very use case dependent, some multi-model analytics may run for days on a small cluster – others for minutes – as before: time is more dependent on data access characteristics.

  • To what extent processing is or can be done in parallel?

 

Most processing would benefit from a parallel map reduce phase, where first distributed data near pre-processing is done, reducing the amount of data to be transferred. Thereafter more complex, shared disk/memory parallel analytics is done on the parts from the map-reduce phase.

Some analysis use cases can benefit from shared memory and distributed memory parallelism to accelerate time to solution. Also to notice: some analysis phase are well suited for  parallel approach (such as one process per model for example)

 

ii.       Process implementation:

  • What do you use in terms of:
    • Programming languages?

 

Python, R, C, C++, Fortran

 

  • Platform (hardware, software)?

 

Linux clusters, mostly open source software basis

 

  • Specific software requirements?
  • What standards need to be supported (e.g. WPS) for each of the above?

 

Data near processing for ENES/ESGF sites is based on the OGC WPS standard.

  • Is there a possibility to inject proprietary/user defined algorithms/processes for each of the above?

 

Yes – by contributing to open source data analytics software projects of various kinds (UV-CDAT, birdhouse, ESMValTool,etc).

 

  • Do you use a sandbox to test and tune the algorithm/process for each of the above?

 

Yes – concrete test procedure is project dependent.

 

iii.       Do you use batch or interactive processing?

 

Both.

 

iv.       Do you use a monitoring console?

 

Yes.

 

v.       Do you use a black box or a workflow for processing?

  • If you use a workflow for processing, could you indicate which one (e.g. Taverna, Kepler, proprietary, etc.)

 

The choice of workflow engine analysis is project or framework specific, e.g. proprietary workflow wrappers, dispel4py.

  • Do you reuse sub-processes across processes?

 

Analysis project dependent, mostly not. 

 

vi.       Please provide concrete examples of processes to be supported/currently in use;

 

Simple: Subsetting of data, mean etc. statistics, downscaling of data, interpolation of data, climate indices calculation (ENSO, NAO, PDO, etc).

-Complex: vegetation modeling, geographical mosquito dispersal

c.     Data processing desiderata: output

i.       What data are produced? Please provide:

  • Typologies

 

Various: netCDF files, graphics, text, logs

  • Volume

 

Various: normally orders smaller then input data

  • Velocity

 

High – depending on analysis activity

  • Variety

 

High

ii.       How are analytics outcomes made available?

 

Different means: only per researcher or research group on file system, some outputs are published in catalogues and accessible via web and python notebook for example.

 

d.     Statistical questions

i.       Is the data collected with a distinct question/hypothesis in mind? Or is simply something being measured?

 

Data is collected according to the requirements and pre-defined characteristics defined for climate model intercomparison projects.

 

e.     Will questions/hypotheses be generated or refined (broadened or narrowed in scope) after the data has been collected? (N.B. Such activity would not be good statistical practice)

 

The requirements and characteristics are refined after every round of model intercomparison projects to improve the next round and to be react on the new possibilities new technical infrastructures provide (e.g.improved processing power to support larger ensembles and finer resolution in models).


 

f.       Statistical data 

i.       Does the question involve analysing the responses of a single set of data (univariate) to other predictor variables or are there multiple response data (bi or multivariate data)?

 

Depending on analysis activity.

ii.       Is the data continuous or discrete?

 

Discrete.

iii.       Is the data bounded in some form (i.e. what is the possible range of the data)?

 

Data represents several hundreds physical quantities (temperature, precipitation, wind speed, etc.) and in that sense are bound by physical laws.

iv.       Typically how many datums approximately are there?

 

Data are stored on grid point covering the entire Earth System influencing the climate (atmosphere, ocean, sea ice, land ). So there are several thousands of data.

 

g.     Statistical data analysis

i.       Is it desired to work within a statistics or data mining paradigm? (N.B. the two can and indeed should overlap!)

 

Statistics are very important in climatic analysis as we are looking after robust and significant signal.

ii.       Is it desired that there is some sort of outlier/anomaly assessment?

 

Yes – but on a Petabyte scale difficult to achieve.

iii.       Are you interested in a statistical approach which rejects null hypotheses (frequentist) or generates probable belief in a hypothesis (Bayesian approach) or do you have no real preference?

 

Needs more details. Yes a priori. The range of scientific analysis done using the data of our RI is very large. But those complex analyses are usually done within the scientific teams not by the RI itself.

 

  1. Provenance

a.     Do you already have data provenance recording in your RI?

 

Yes, depending on the data analysis activity

 

If so:

 

b.     Where/when do you need it, e.g., in the data processing workflows, data collection/curation procedures, versioning control in the repositories etc.?

 

Mostly in data collection procedures as well as data processing workflows

 

c.     What systems are you using?

 

Community tools e.g. to manage what has been collected from where, and what is the overall transfer status or e.g. to generate provenance log files in workflows.

 

d.     What standards are you using?

i.       Advantages/disadvantages

 

No standard by now, first experiments toward the use of PROV-O in a specific analysis project.

 

ii.       Have you ever heard about the PROV-O standard?

Yes.

 

e.     Do you need provenance tracking?

i.       If so, which information should be contained?

 

Input data characteristics (names, characterizing facets, checksum, unique ids), tools used (git svn tags), output files, timing information, platform/environment information.

 

f.       What information do you need to record regarding the following:

i.       Scientific question and working hypothesis?

 

The data has been produced following a very details experimental protocol. We need to collect all the information needed to assess how exactly the protocol has been followed (facets, control vocabulary, documentation : es-doc.org).

 

ii.       Investigation design?

 

Authors information.

 

iii.       Observation and/or measurement methods?

iv.       Observation and/or measurement devices?

v.       Observation/measurement context (who, what, when, where, etc.)?

vi.       Processing methods, queries?

vii.       Quality assurance?

 

Performed quality assurance procedures, results of QA software.

 

g.     Do you know/use controlled vocabularies, e.g. ontologies, taxonomies and other formally specified terms, for the description of the steps for data provenance?

 

Not yet.

 

h.    What support, e.g. software, tools, and operational procedures (workflows), do you think is needed for provenance tracking?

 

Agreements on what information to record and simple APIs to be able to be integrated in analysis tools and frameworks.

 

i.       How does your community use/plan to use the provenance information?

 

-For catalogues as additional metadata for data products.

-For end users to understand the derivation history of data products.

-For tools to automatically “replay” specific analysis parts.

 

i.       Do you have any tools or services in place/planned for this purpose?

 

No generic ones – specific loggers, etc.

 

  1. Optimization

 

13.           Related to your answer to the generic question 7 (What part of your RI needs to be improved):

i.       What does it mean for this to be optimal in your opinion?

 

-Easy, standardized interfaces for command line usage as well as portal integration ,

-Faster, more robust and fully automated replication procedures. Fast replication across continents is key to accelerate data access at an early stage of a major project.

-Policies etc. for assignment of  compute resources to user (groups)

-Funding for community computing resources.

 

ii.       How do you measure optimality in this case?  

  • Are there any existing metrics being applied?

 

RIs have a set of KPIs that will progress if those areas are improved.

  • Are there any standard metrics applied by domain scientists in this discipline?

 

End user satisfaction. Number of publications should progress f aster than before if we progress following those directions.

 

iii.       Do you already know what needs to be done to make   this   optimal?  

  • Is it simply a matter of more resources, better machines, or does it require a rethink about how the infrastructure should be designed?

 

A rethink is necessary on one hand to get the end users (“data analysts”).

 

iv.       What would you   not   want from an 'optimal' solution?   For example, maximizing one attribute of a component or process (e.g. execution time) might come at the cost of another attribute (e.g. ease-of-use), which ultimately may prove undesirable.

 

Due to the amounts of data, we would not like to lower network performance. Also fundamental is the “ease of use” of the RI by scientists and by engineer.

 

b.     Follow-up questions to answers from other sections which suggest the need for the optimization of certain RI components.

 

Data citation is currently not an easy task because our data collections are extremely complex. We can progress along that line.

 

c.     Do you have any use case/scenarios to show potential bottlenecks in 1) the functionality of your RI, for example the storage, access and delivery of data, doing processing, handling the workflow complexity etc. 2) ensuring the non-functional requirements of your RI, for example ensuring load balance in resource usage etc. 

 

Ensuring load balance when computing services will be made widely available will be a challenge. Network resources are also a potential bottleneck because of the data volume we are dealing with.

 

d.     To understand those bottlenecks:

i.       what might be the peak volume in accessing, storing, and delivering data?

 

Previous project (CMIP5) had up to about 10 TB over all (mainly 3) European nodes daily. We expect CMIP6 to show significantly higher values.

 

ii.       what complexity might the data processing workflow have?

 

We presently need to handle rather complex workflows.

 

iii.       Are there any specific quality requirements for accessing, delivering or storing data, in order to handle the data in nearly real time?

 

No.

 

 

  1.              Community support

 

We define Community Support as being concerned with managing, controlling and tracking users' activities within an RI and with supporting all users to conduct their roles in their communities. It includes many miscellaneous aspects of RI operations, including for example (non-exhaustively) authentication, authorization and accounting, the use of virtual organizations, training and helpdesk activities.

 

a.               Training Requirements

i.       Do you use or plan to use e-Infrastructure technology?

 

Using Cloud, Grid, HPC, cluster computing, ESGF is e-infrastructure in this sense

ii.       What is your community training plan?

 

Workshops from time to time. Also we communicate inside our communities training course and workshops organized by HPC center or European projects (PRACE, EGI,...).

iii.       Does your community consider adopting e-Infrastructure solutions (e.g., Cloud, Grid, HPC, cluster computing).

 

n/a

iv.       Is your community interested in training courses that introduce state-of-the-art e-Infrastructure technology?

 

Need to see the detail. Potentially yes.

 

v.       What topics (related to e-Infrastructure solutions) would your community be interested in?

 

Load balancing, compute resources management.

 

vi.       Who would be audience?

  • Please describe their knowledge background of e-Infrastructure technology

 

Developers and PIs of out RI at the first stage.

 

vii.       What are appropriate methods to deliver this training?

 

Workshops.

b.              Requirements for the Community Support Subsystem:

i.       What are the required   functionalities of your Community Support capability?

 

We have AAI, help desks and accountings activities in place.

 

ii.       What are the non-functional requirements, e.g., privacy, licensing, performance?

 

Good performance for high data volumes. Some data have licensing constraints that will restrict access to a certain group of users.

 

iii.       What standards do you use, e.g.,   related to data, metadata, web services?

 

Metadata: ISO, DIF, SAML, REST, DC…

iv.       What community software/services/applications do you use?

For AAI we use OAuth2, OpenID, SAML, X509.

For ESGF: LAS, Synda, Birdhouse, netCDF, Thredds…