페이지 트리
메타 데이터의 끝으로 건너뛰기
메타 데이터의 시작으로 이동

이 페이지의 이전 버전을 보고 있습니다. 현재 버전 보기.

현재와 비교 페이지 이력 보기

« 이전 버전 8 다음 »

Introduction defining context and scope

(Scientific Data) Processing or Analytics is a quite vast domain including any activity or process that performs a series of actions on dataset(s) to distil information (add citation SIGMOD Record paper). It is particularly important in scientific domains especially with the advent of the 4th Paradigm and the availability of “big data” (add a citation to 4th paradigm book). Almost any Research Infrastructure is called to deal with some sort of scientific data processing tasks. Data analytics methods are drawn on multiple disciplines including statistics, quantitative analysis, data mining, and machine learning. Very often these methods might require computing intensive infrastructures to produce their results in a suitable time, because of the data to be processed (e.g. huge in volume or heterogeneity) and/or because of the complexity of the algorithm/model to be elaborated/projected. Moreover, these methods being devised to analyse dataset(s) and produce other “data”/information (than can be considered a dataset) are strongly characterised by the “typologies” of such input and output.

This technology review focuses on the following aspects:

  • ...
  • Data processing enactment platforms;
  • Scientific Workflow Management Systems;
  • Protocols and Standards for data processing;
  • ...

Change history and amendment procedure

The review of this topic will be organised by  in consultation with the following volunteers: . They will partition the exploration and gathering of information and collaborate on the analysis and formulation of the initial report. Record details of the major steps in the change history table below.For further details of the complete procedure see item 4 on the Getting Started page.

Note: Do not record editorial / typographical changes. Only record significant changes of content.

DateNameInstitutionNature of the information added / changed
    

Sources of information used

The scientific data processing domain is extent and multifaceted. Whenever possible we relied on literature that has been published or is going to be published. The second major source of information is the Web including technology web-sites and project/RI web-sites.     

Two-to-five year analysis

State of the art

*** A snapshot by Aleksi Kallio (CSC) ***

The hype was big data technologies started with Google MapReduce, which soon was implemented in open source by Apache Hadoop. Hadoop consists of two major components, the Hadoop Filesystem for storing data in replicated and distributed manner, and the map-reduce execution engine for batch processing data. Hadoop remains to be the mostly widely used system for production workloads, but many alternative technologies have been introduced. Most notably Apache Spark has quickly gained a wide user base. It provides efficient ad hoc processing, in-memory computing and convenient programming interfaces. Apache Spark is typically used in conjunction with the Hadoop Filesystem. Database-like solutions include e.g. Hive, the original and robust system for heavy queries, Apache Cassandra for scalable and highly available workloads and Cloudera Impala for extreme performance. Apacha Spark also provides Spark SQL for queries.

The most used programming languages for data science tasks are Python and R. Python is widely used for manipulating and pre-processing, but via popular libraries such as Pandas it also support rich variety of data analysis methods. The R language is the de facto tool for statistical data analysis, boasting the most comprehensive collection of statistical methods freely available. Both languages have bindings to big data systems such as Apache Spark.

Subsequent headings for each trend (if appropriate in this HL3 style)

Problems to be overcome

Sub-headings as appropriate in HL3 style (one per problem)

Details underpinning above analysis

Sketch of a longer-term horizon

Relationships with requirements and use cases

Summary of analysis highlighting implications and issues

 

 

Bibliography and references to sources

 

  1. R. Bordawekar, B. Blainey, C. Apte (2014) Analyzing analytics. SIGMOD Rec. 42, 4 (February 2014), 17-28. DOI=http://dx.doi.org/10.1145/2590989.2590993

  2. T. Hey, S. Tansley, K. Tolle (2009). The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research. http://research.microsoft.com/en-us/collaboration/fourthparadigm/

  3. C. S. Liew, M. Atkinson, M. Galea, P. Martin, T. F. Ang, J. I. van Hemert (TO APPEAR) Scientific Workflow Management Systems: Moving Across Paradigms. ACM Comput. Surv.
  4. J. Yu and R. Buyya (2005) A taxonomy of scientific workflow systems for grid computing. SIGMOD Rec. 34, 3 (September 2005), 44-49. DOI=http://dx.doi.org/10.1145/1084805.1084814
  • 레이블 없음