Original Proposal




This is the original text from the DataGrid Technical Annex, defining WP2.

Workpackage 2 - GRID Data Management

In an increasing number of scientific and commercial disciplines, large databases are emerging as important community resources. The goal of this work package is to specify, develop, integrate and test tools and middle-ware infrastructure to coherently manage and share Petabyte-scale information volumes in high-throughput production-quality grid environments. The work package will develop a general-purpose information sharing solution with unprecedented automation, ease of use, scalability, uniformity, transparency and heterogeneity.

 

It will enable secure access to massive amounts of data in a universal global name space, to move and replicate data at high speed from one geographical site to another, and to manage synchronisation of remote copies. Novel software for automated wide-area data caching and distribution will act according to dynamic usage patterns. Generic interfacing to heterogeneous mass storage management systems will enable seamless and efficient integration of distributed resources.

The overall interaction of the components foreseen for this work package is depicted in the diagram. Arrows indicate “use” relationships; component A uses component B to accomplish its responsibilities. The Replica Manager manages file and meta data copies in a distributed and hierarchical cache. It uses and is driven by plug-able and customisable replication policies. It further uses the Data Mover to accomplish its tasks. The data mover transfers files from one storage system to another one. To implement its functionality, it uses the Data Accessor and the Data Locator, which maps location independent identifiers to location dependent identifiers. The Data Accessor is an interface encapsulating the details of the local file system and mass storage systems such as Castor, HPSS and others. Several implementations of this generic interface may exist, the so-called Storage Managers. They typically delegate requests to a particular kind of storage system. Storage Managers are outside the scope of this work package. The Data Locator makes use of the generic Meta Data Manager, which is responsible for efficient publishing and management of a distributed and hierarchical set of associations, i.e. {identifier à information object} pairs. Query Optimisation and Access Pattern Management ensures that for a given query an optimal migration and replication execution plan is produced. Such plans are generated on the basis of published meta data including dynamic logging information. All components provide appropriate Security mechanisms that transparently span worldwide independent organisational institutions. The granularity of access is both on the file level as well as on the data set level. A data set is seen as a set of logically related files.

An important innovative aspect of WP2 is bringing Grid data management technology to a level of practical reliability and functionality to enable it to be deployed in a production quality environment – this is a real challenge. The work by the Globus team and that of current US projects (GriPhyN, PPDG) is attempting to solve similar Data Management problems. We will be trying as far as possible to avoid unnecessary duplication of major middleware features and approaches by keeping aware of their work and collaborating as fully as possible.

Work Package Tasks 2.3 (Replication), and 2.6 (Query Optimisation) will be the main areas where novel techniques will be explored, such as the use of cooperating agents with a certain amount of autonomy. It is planned to apply this technology to permit a dynamic optimisation of data distribution across the DataGrid as this data is accessed by a varying load of processing tasks present in the system.

Task 2.1 Requirements definition (month 1-3)

In this phase a strong interaction with the Architecture Task Force and the end users will be necessary. The results of this task will be collated by the project architect and issued as an internal project deliverable.

Task 2.2 Data access and migration (month 4-18)

This task handles uniform and fast transfer of files from one storage system to another. It may, for example, migrate a file from a local file system of node X over the grid into a Castor disk pool. An interface encapsulating the details of Mass Storage Systems and Local File System provides access to data held in a storage system. The Data Accessor sits on top of any arbitrary storage system so that the storage system is grid accessible.

Task 2.3 Replication (month 4-24)

Copies of files and meta data need to be managed in a distributed and hierarchical cache so that a set of files (e.g. Objectivity databases) can be replicated to a set of remote sites and made available there. To this end, location independent identifiers are mapped to location dependent identifiers. All replicas of a given file can be looked up. Plug-in mechanisms to incorporate custom tailored registration and integration of data sets into Database Management Systems will be provided.

Task 2.4 Meta data management (month 4-24)

The glue for components takes the shape of a Meta Data Management Service, or simply Grid Information Service. It efficiently and consistently publishes and manages a distributed and hierarchical set of associations, i.e. {identifier à information object} pairs. The key challenge of this service is to integrate diversity, decentralisation and heterogeneity. Meta data from distributed autonomous sites can turn into information only if straightforward mechanisms for using it are in place. Thus, the service defines and builds upon a versatile and uniform protocol, such as LDAP. Multiple implementations of the protocol will be used as required, each focussing on different trade-offs in the space spanned by write/read/update/search-performance and consistency.

Research is required in the following areas:

  •  Maintenance of global consistency without sacrificing performance. A practical approach could be to ensure local consistency within a domain and allow for unreliable and incomplete global state
  •   Definition of suitable formats for generic and domain dependent meta data

Task 2.5 Security and transparent access (month 4-24)

This task provides global authentication (“who are you”) and local authorisation (“what can you do”) of users and applications acting on behalf of users. Local sites retain full control over the use of their resources. Users are presented a logical view of the system, hiding physical implementations and details such as locations of data.

Task 2.6 Query optimisation support and access pattern management (month 4-24)

Given a query, produces a migration and replication execution plan that maximises throughput. Research is required in order to determine, for example, how long it would take to run the following execution plan: Purge files {a,b,c}, replicate {d,e,f} from location A to location B, read files {d,e,f} from B, read {h} from location C, in any order;

The Meta Data Management service will be used to keep track of what data sets are requested by users, so that the information can be made available for this service.

Task 2.7 Testing, refinement and co-ordination (month 1-36)

The testing and refinement of each of the software components produced by Tasks T2.2, T2.3, T2.4, T2.5, T2.6 will be accomplished by this task, which continues to the end of the project. This task will take as its input the feedback received from the Integration Testbed work package and ensure the lessons learned, software quality improvements and additional requirements are designed, implemented and further tested.

In addition, the activities needed for co-ordination of all WP2 tasks will be carried out as part of this Task.

Resources

The resources required to implement the workpackage are as follows:

Task

Total PM

CERN

ITC

UH

VR

INFN

PPARC

2.1

20 (6)

4

4

4

4

2

2

2.2

40 (4)

8

0

0

8

24

0

2.3

42 (13)

26

8

8

0

0

0

2.4

62 (20)

10

0

36

0

0

16

2.5

62 (18)

10

0

0

36

0

16

2.6

50 (23)

14

36

0

0

0

0

2.7

172 (60)

52

24

24

24

28

20

Total PM

448

124

72

72

72

54

54

Funded PM

180

36

72

36

36

0

0


Description sheet:

 

Workpackage description – Grid Data Management

 

Workpackage number :

2

Start date or starting event:

Project Start

Participant:

CERN

ITC

UH

VR

   

Total

Person-months per participant:

36

72

36

36

 

 

180

 

Objectives

The goal of this work package is to specify, develop, integrate and test tools and middle-ware infrastructure to coherently manage and share Petabyte-scale information volumes in high-throughput production-quality grid environments. The work package will develop a general-purpose information sharing solution with unprecedented automation, ease of use, scalability, uniformity, transparency and heterogeneity.

Description of work

Task 2.1: The results of this task will be collated and issued as a deliverable.

Task 2.2: Produces software for uniform and fast transfer of files from one storage system to another.

Task 2.3: Manages copies of files and meta data in a distributed and hierarchical cache.

Task 2.4: Publishes and manages a distributed and hierarchical set of associations.

Task 2.5: Provides global authentication and local authorisation.

Task 2.6: Produces a migration and replication execution plan that maximises throughput.

Task 2.7: Takes as input the feedback received from the Integration Testbed work package and ensures the lessons learned, software quality improvements and additional requirements are designed, implemented and further tested. Also assures the co-ordination of all sub-Tasks of this work package.

Deliverables

D2.1 (Report) Month 4:  Report of current technology

D2.2 (Report) Month 6: Detailed report on requirements, architectural design, and evaluation criteria – input to the project architecture deliverable (see WP12)

D2.3 (Prototype) Month 9: Components and documentation for the first Project Release (see WP6)

D2.4 (Prototype) Month 21: Components and documentation for the second Project Release

D2.5 (Prototype) Month 33: Components and documentation for the final Project Release

D2.6 (Report) Month 36: Final evaluation report

Milestones and expected result

M2.1 Month 9: Components and documentation for the first Project Release completed.

M2.2 Month 21: Components and documentation for the second Project Release completed.

M2.3 Month 33:.Components and documentation for the final Project Release completed.

 

The European Organization for Nuclear Research
Feedback and questions concerning this site should be directed to EDG-WP2@cern.ch
Last updated June 20, 2003