Guido Aben, AARNet’s eResearch Director, says with the massive volume of data in play today there is a pressing need to develop a strongly linked research data lifecycle system.
A number of the larger eResearch infrastructure capabilities, namely the Australian Access Federation (AAF), Australian National Data Service (ANDS), Research Data Services (RDS), National eResearch Collaboration Tools and Resources (NeCTAR) and AARNet, have joined forces to map out a possible solution.
Developing the Data LifeCycle Framework
Led by RDS Director Ian Duncan, the group has developed the Data LifeCycle Framework to enable a better understanding of research data nationally, including its location, uses, owners, custodians, and provenance.
The idea is to provide the connecting infrastructure between existing local data management processes and policies and the wide array of national, local, state-based, and commercial infrastructures available to researchers.
The group is also looking at what parts of this solution can be built from components already in use by the research sector.
Building blocks for a data lifecycle system
At the Digital Infrastructures for Research Conference in Poland and at eResearch Australasia 2016, Aben talked about the potential for the AARNet CloudStor solution to function as the building block for the data movement core of this system; the “data pump”, capable of ingesting as well as egressing data through regular file transfers, as well as sync&share.
Here is a summary of the “data pump” concept from his presentation abstract:
This sync&share capability is seen as absolutely vital; indeed, through earlier, smaller pilots using ownCloud as a frontend for generic storage, we have found out just how big the difference in end user uptake is between cloud storage presented to researchers through “legacy tooling” and the same storage presented through a sync&share frontend — orders of magnitude, without exaggeration.
Automating data movement
The ability to ingest and egress data through machine-to-machine (M2M)protocols is needed to take data direct from instruments, and also to push data through intermediary compute and workflow platforms. We intend to allow users to define data routing triggers (e.g., close of a file transfer, time of day, metadata values).
Ideally, a user will be able to rely on the planned system to automatically synch new data into the data pump, wait for a set condition to trigger, then move the new dataset to a workflow engine; wait for the engine to signal it’s done, let the system retrieve the data, and synch it back to the user as well as issue a share invite to this user’s collaborators; domestic or overseas. Also possible would be the automatic move of a finalised dataset, including attendant metadata, into the correct institutional repository, ready for citation as open data with correct provenance records and signalling to the funding body.
CloudStor as the central data pump
As far as implementation goes, the component we intend to use as the “central data pump” is already in operation at AARNet, as a service offering called “CloudStor”; a sync&share platform based on ownCloud and used by ~26.000 users. Fortuitously, a good proportion of the existing Australian capabilities in both storage as well as workflow compute are based on OpenStack, so for interlinking pilots we have so far focused on Swift as the M2M data movement protocol; learnings here can easily be generalised to “S3” as the M2M storage interface.
Moving terabytes and more
Given we are targeting substantial sizes both in data (terabytes and above) and geographies (continental), we must realise the solution needs to scale and retain performance over high latencies (~100ms).
AARNet’s CloudStor file sender and storage platform has been in operation for a more than 4 years, maturing from a tentative proof of concept to a robust, supported service with over 26,000 users and 70TB of data stored.
CloudStor, as it is currently deployed, was designed to deliver a continental scale data movement and sync&share solution for the Australian research community. The interface functionality and client platform compatibility was intentionally designed to resemble the mainstream functionality present in popular enterprise file sender and storage platforms, but with a backend capable of moving science‐size data volumes (Gigabytes to Terabytes, and directories with multiple thousands of files).
To date, CloudStor has operated in essentially a standalone mode; files are uploaded, downloaded, synchronised, shared and previewed all within the confines of the CloudStor system itself. This isn’t isolation by design; merely that up to this point, attention has been focused predominantly at “ruggedizing” the design and smoothing the user experience.
However, the system as built contains a number of potentially very useful APIs and connectors that could be used to tie together the currently non‐joined up parts of the Australian eResearch landscape. Some of these connectors have already been tested in small scale proofs of concept and others are ready to be tested.
With CloudStor used as a “uniform data pump”, many of the other entities in the landscape might work better together without themselves having to worry about large scale data movement issues.
Guido Aben is AARNet’s Director of eResearch
Aben holds a MSC in Physics from Utrecht University. He’s definitely a generalist more than a specialist. His current responsibility is to build services to researchers’ demands, with CloudStor and Science DMZ perhaps currently the most widely known of these.