The project, called Science Mesh, aims to build a global interoperable mesh of synchronization and sharing cloud services as part of the European Open Science Cloud by federating cloud storage sites and software providers around the world.
Science Mesh will, for example, make it easier for international scientific community collaborations to access, share and analyse the more than 300 Petabytes of physics data generated from CERN’s Large Hadron Collider (LHC) experiments. The volume of LHC data stored at CERN is expected to grow to half an exabyte in the coming years.
Dr Jakub Moscicki, head of disk storage operations at CERN and Science Mesh project coordinator, spoke about the project at the eResearch Australia 2019 conference held in Brisbane recently.
Moscicki said commercial cloud services have changed the way scientists collaborate and work with data, but moving the enormous volumes of data generated at CERN to external storage so that these services can be utilised is impractical. CERN instead developed an in-house collaboration platform, which the Science Mesh project will now build upon.
“We wanted to provide CERN scientists with the opportunity to use these services on premise so developed an in-house cloud collaboration platform with integrated applications and the same ease of use as commercial cloud services,” he said. “The Science Mesh project takes the idea to the next level by extending the capability to the 600 institutions and organisations that make up CERN’s international community.”
CERN’s collaboration platform, CERNBox, is a web-based large file storage synchronisation and sharing service that provides access to user data and the CERN data store, and a range of applications for working with data including concurrent file editing and computational tools such as SWAN, a system for web-based analysis using Jupyter Notebook. All the elements are combined into one coherent system that integrates very easily into researchers’ workflows.
The rise of sync&share storage services
Over the past ~4 years, research file storage services like CERNBox, based on desktop synchronization and sharing, also known as enterprise file sync&share (EFSS), have taken off in earnest around the world. They are typically operated and funded by major research institutions—CERN is a leading example; major e-infrastructure providers; and NRENs (National Research & Education Networks).
In Australia, AARNet’s CloudStor platform is an example of this trend. In The Netherlands, it’s SURFnet’s SURFdrive. Country by country these services have become successful on a national level, gradually becoming an indispensable element of daily workflows for hundreds of thousands of users including researchers, students, archivists, scientists and engineers.
These well-used data services are now dotted across the science landscape, but they are still run in isolation from each other. The obvious thing to do, which the systems and user bases are ready for, is to interlink these various systems and services and thus arrive at a consolidated view of both the data held within, as well as the user community seeking to collaborate through these systems.
Such a federated system would enable much greater global collaboration and would more easily serve as a deployment platform for open science activities that benefit from single (virtual) deployment or a single worldview (e.g., citation, archival, metrics). It would save any user of system #1 of searching in vain for a collaborator or dataset present only on system #2. In short, this would improve the Findable, Accessible, Interoperable and Reusability (FAIRness) of data in active research projects, for all sites involved.
Fortunately, the European Commission agreed with the concept and, just recently, funded the Science Mesh project consortium proposal to build this interlinking system as part of the European Open Science Cloud, with AARNet as an international partner. The project received funding of six million euros for three years and commences in January 2020.
Interconnecting global research data sharing services
There are 12 partners working on creating the initial infrastructure, which will involve interconnecting existing sustainable data sharing services, such as CERNBox, AARNet’s CloudStor, SURFdrive and others. These services collectively have around 200,000 users.
“By interconnecting these services we are adding the capability for users – researchers, educators, data curators and data analysts – to collaborate across all the services with the same ease of use they experience collaborating within one service,” said Moscicki.
Important substrate services and protocols also already exist and can be reused: the Open Cloud Mesh (OCM) protocol (co-developed by research and education networks under the GEANT banner) can already signal data shares between enterprise file sync&share systems; and EduGAIN (deployed and supported in Australia by the Australian Access Federation) can signal users and authorization levels between different countries’ access federations.
Providing a global synch&share collaboration service for the research and education community is one goal of the Science Mesh project, another key goal is to provide an interoperable platform for easily sharing and deploying applications and software components to extend the functionality of the service.
“There are plenty of people in research institutes and computing centres who come up with new ideas for how to integrate functionality but currently it is very hard because there is no way to integrate software so that it can be used in other labs. We aim to solve this problem by creating an integrated ecosystem of opensource applications, plugins and components that will allow us to extend the federation much more easily in the future,” said Moscicki.
AARNet’s role in the Science Mesh project involves being a non-funded partner contributing to developing standards for metadata handling for data transfer; facilitating the integration of AARNet-connected institutions, instruments and facilities such as supercomputing centres; and contributing use cases for various research domains to inform development .
The benefits for the Australian research and education community of AARNet’s involvement in this project include seamless integration with the global research and education community, and AARNet’s ability to leverage the Science Mesh API development domestically for integrating more services with AARNet’s CloudStor service.