Environmental science now relies on the acquisition of great quantities of data from a range of sources. That data might be consolidated into a few very large datasets, or dispersed across many smaller datasets; the data may be ingested in batch or accumulated over a prolonged period. To use this wealth of data effectively, it is important that the data is both optimally distributed across a research infrastructure's data stores, and carefully characterised to permit easy retrieval based on a range of parameters. It is also important that experiments conducted on the data can be easily compartmentalised so that individual processing tasks can be parallelised and executed close to the data itself, so as to optimise use of resources and provide swift results for investigators.
We are concerned here with the gathering and scrutiny of requirements for optimisation. More pragmatically, we are concerned with how we might develop generically applicable methods by which to optimise the research output of environmental science research infrastructures, based on the needs and ambitions of the infrastructures surveyed in the early part of the ENVRI+ project.
Perhaps moreso than the other topics, optimisation requirements are driven by the specific requirements of those other topics, particularly processing, since the intention is to address specific technical challenges in need of refined solutions, albeit implemented in a way that can be generalised to more than one infrastructure. For each part of an infrastructure in need for improvement, we must consider:
More specifically, we want to focus on certain practical and broadly universal technical concerns:
Optimisation gathering is coordinated by with help from go betweens.
Many optimisation problems, whether explicitly identified as such by RIs for implicit in the requirements of other topics, can be reduced down to ones of data placement. Is the data need by researchers available in a location from which it can be easily identified, retrieved and analysed, in whole or in part? Is it feasible to perform analysis on that data without substantial additional preparation, and if not, what is the overhead in time and effort required to prepare the data for processing? This latter question relates to the notion of data staging, whereby data is placed and prepared for processing on some computational service (whether provided on a researcher's desktop, an HPC cluster or a web server), which in turn concerns the further question of whether data should be brought to where they can be best computed, or computing tasks brought to where the data currently reside. Given the large size of many RI's primary datasets, bringing computation to data is appealing, but the complexity of various analyses also often requires supercomputing-level resources, which require the data be staged at a computing facility such as are brokered in Europe by PRACE.
Reductionism aside however, the key performance indicator used by most RIs is researcher productivity. Can researchers use the RI to efficiently locate the data they need? Do they have access to all the support available for processing the data and conducting their experiments? Can they replicate the cited results of their peers using the facilities provided? This raises yet another question: how does the service provided to researchers translate to requirements on data placement and infrastructure availability?
Good provenance is fundamental to optimisation—in order to be able to anticipate how data will be used by the community, and what infrastructure should be able conscripted to provide access to those data, it is necessary to understand as much about the data as possible. Provenance is required to answer who, what, where, when, why and how regarding the orgins of data, and the role of an optimised RI is to rknow the answers for who, what, where, when, why and how regarding the future use of data. Ensuring that these questions can be asked and answered becomes more challenging the greater the heterogeneity of the data being handled by the RI.
Quality checking of datasets is also important, in order to ensure that the data is fit for purpose.
Streamlining the acquisition of data from data providers is important to many RIs, both to maximise the range and timeliness of datasets then made available to researchers, and to increase data security (by ensuring that it is properly curated with minimal delay).
The efficient exploitation of RI infrastructure, in terms of data transportation, placement, and the serving of computing resources, requires knowledge about the RI and its assets. This knowledge is usually embedded in the technical experts assigned to manage the infrastructure; however the ability for infrastructure services to acquire the knowledge to manage themselves (even if only to the extent of provisioning resources on cloud infrastructure to support static resources) would allow for greater flexibility and agility in RI composition.
The following RIs contributed to developing optimisation requirements:
Euro-Argo: This RI is interested in: providing full contextual information for all of its datasets (who, what, where, when, why, how); local replication of datasets (to make processing more efficient); cloud replication of the Copernicus marine service in-situ data in order to make it more accessible to the marine research community.
IS-ENES2: This RI has an interest in: standardised interfaces for interacting with services; automated replication procedures for ensuring the availability of data across all continents; policies for the assignment of compute resources to user groups; funding for community computing resources.
SeaDataNet: The RI is interested in: minimising the footprint of marine data observation; helping automate the submission of datasets by data providers to the RI; balancing the trade-offs between centralisation and distribution for performance, control and visibility; tackling organisational bottlenecks.