페이지 트리
메타 데이터의 끝으로 건너뛰기
메타 데이터의 시작으로 이동

이 페이지의 이전 버전을 보고 있습니다. 현재 버전 보기.

현재와 비교 페이지 이력 보기

« 이전 버전 8 다음 »

Introduction

Environmental science now relies on the acquisition of great quantities of data from a range of sources. That data might be consolidated into a few very large datasets, or dispersed across many smaller datasets; the data may be ingested in batch or accumulated over a prolonged period. To use this wealth of data effectively, it is important that the data is both optimally distributed across a research infrastructure's data stores, and carefully characterised to permit easy retrieval based on a range of parameters. It is also important that experiments conducted on the data can be easily compartmentalised so that individual processing tasks can be parallelised and executed close to the data itself, so as to optimise use of resources and provide swift results for investigators.

We are concerned here with the gathering and scrutiny of requirements for optimisation. More pragmatically, we are concerned with how we might develop generically applicable methods by which to optimise the research output of environmental science research infrastructures, based on the needs and ambitions of the infrastructures surveyed in the early part of the ENVRI+ project.

Perhaps moreso than the other topics, optimisation requirements are driven by the specific requirements of those other topics, particularly processing, since the intention is to address specific technical challenges in need of refined solutions, albeit implemented in a way that can be generalised to more than one infrastructure. For each part of an infrastructure in need for improvement, we must consider:

  • What does it mean for this part to be optimal?
  • How is optimality measured—do relevant metrics already exist as standard?
  • How is optimality achieved—is it simply a matter of more resources, better machines, or is there need for a fundamental rethink of approach?
  • What can and cannot be sacrificed for the sake of 'optimality'? For example, it may be undesirable to sacrifice ease-of-use for a modest increase in the speed at which experiments can be executed.

More specifically, we want to focus on certain practical and broadly universal technical concerns:

  • What bottlenecks exist in the functionality of (for example) storage, access and delivery of data, data processing, and workflow management?
  • What are the current peak volumes for data access, storage and delivery for parts of the infrastructure?
  • What is the (computational) complexity of different data processing workflows?
  • What are the specific quality (of service, of experience) requirements for data handling, especially for real time data handling?

Optimisation gathering is coordinated by with help from go betweens.

Overview and summary of optimisation requirements

Many optimisation problems, whether explicitly identified as such by RIs for implicit in the requirements of other topics, can be reduced down to ones of data placement. Is the data need by researchers available in a location from which it can be easily identified, retrieved and analysed, in whole or in part? Is it feasible to perform analysis on that data without substantial additional preparation, and if not, what is the overhead in time and effort required to prepare the data for processing? This latter question relates to the notion of data staging, whereby data is placed and prepared for processing on some computational service (whether provided on a researcher's desktop, an HPC cluster or a web server), which in turn concerns the further question of whether data should be brought to where they can be best computed, or computing tasks brought to where the data currently reside. Given the large size of many RI's primary datasets, bringing computation to data is appealing, but the complexity of various analyses also often requires supercomputing-level resources, which require the data be staged at a computing facility such as are brokered in Europe by PRACE.

Reductionism aside however, the key performance indicator used by most RIs is researcher productivity. Can researchers use the RI to efficiently locate the data they need? Do they have access to all the support available for processing the data and conducting their experiments? Can they replicate the cited results of their peers using the facilities provided? This raises yet another question: how does the service provided to researchers translate to requirements on data placement and infrastructure availability?

Data context

Good provenance is fundamental to optimisation—in order to be able to anticipate how data will be used by the community, and what infrastructure should be able conscripted to provide access to those data, it is necessary to understand as much about the data as possible. Provenance is required to answer who, what, where, when, why and how regarding the orgins of data, and the role of an optimised RI is to rknow the answers for who, what, where, when, why and how regarding the future use of data. Ensuring that these questions can be asked and answered becomes more challenging the greater the heterogeneity of the data being handled by the RI.

Quality checking of datasets is also important, in order to ensure that the data is fit for purpose.

Streamlining the acquisition of data from data providers is important to many RIs, both to maximise the range and timeliness of datasets then made available to researchers, and to increase data security (by ensuring that it is properly curated with minimal delay).

Knowledge infrastructure

The efficient exploitation of RI infrastructure, in terms of data transportation, placement, and the serving of computing resources, requires knowledge about the RI and its assets. This knowledge is usually embedded in the technical experts assigned to manage the infrastructure; however the ability for infrastructure services to acquire the knowledge to manage themselves (even if only to the extent of provisioning resources on cloud infrastructure to support static resources) would allow for greater flexibility and agility in RI composition.

Research Infrastructures

The following RIs contributed to developing optimisation requirements:

Euro-Argo: This RI is interested in: providing full contextual information for all of its datasets (who, what, where, when, why, how); local replication of datasets (to make processing more efficient); cloud replication of the Copernicus marine service in-situ data in order to make it more accessible to the marine research community.

IS-ENES2: This RI has an interest in: standardised interfaces for interacting with services; automated replication procedures for ensuring the availability of data across all continents; policies for the assignment of compute resources to user groups; funding for community computing resources.

SeaDataNet: The RI is interested in: minimising the footprint of marine data observation; helping automate the submission of datasets by data providers to the RI; balancing the trade-offs between centralisation and distribution for performance, control and visibility; tackling organisational bottlenecks.

 

 

 

 

  • 레이블 없음