From endangered languages to the Science Mesh

01 March, 2021

From endangered languages to the Science Mesh

Australian researchers are helping to design and build next-gen petascale FAIR data repositories.

Research Data & Collaboration News

The scientific community is a global community with collaboration at its heart. A great example of this is the Science Mesh, a European Union-funded joint effort to build a rich ecosystem that enables frictionless data collaboration for research. The platform connects data, applications, compute and research services and provides users with federated access across scientific domains. It is built to encourage FAIR data practices and drive research outcomes by allowing computational systems to find, access, interoperate, and reuse –with little human intervention—the ever-increasing volumes of complex data being generated today.

Researchers and software developers from around the world are contributing knowledge and tools to the development of the Science Mesh platform, including from Australia.

Under the direction of AARNet Director, International e-Infrastructure Partnerships Guido Aben, the developers of the AARNet CloudStor platform in conjunction with PARADISEC and UTS data experts are currently involved in the task of adding FAIR data capabilities to Science Mesh nodes. PARADISEC’s Marco La Rosa has been working on rebuilding the core storage component of the Pacific and Regional Archive for Digital Sources in Endangered Cultures – (PARADISEC), as a foundation for a next generation FAIR data archive. Along with Peter Sefton, an editor of the Research Object Crate specification and expert in digital repositories, they are sharing their expertise in this field with the Science Mesh project.

Towards a next-gen FAIR data archive

Here, Marco La Rosa details the knowledge and tools crossing over from PARADISEC and UTS to the Science Mesh:

PARADISEC has been operating for 18 years and currently holds material in 1,270 languages across Australia and the Pacific. The archive contains over 115TB of content including more than 14,000 hours of audio recordings, 1,600 hours of video and 8,000 transcriptions. It is a facility that acts as an archive of research recordings as well as forming an integral part of the research workflow in which primary data is made citable, is preserved, and is publicised (with licence agreements) for access.

In its current form the archive is driven by a monolithic Ruby on Rails application. Although the application is showing its age it doesn’t suffer from issues of scale because of the early design decisions made around the storage of the data. Specifically, data is stored by collection identifier and item identifier resulting in adequate file system distribution and folders that don’t have too many entries. Further, the export of the metadata from the application database to XML (every time an item in the catalog is saved) and its storage with the data means each item and collection is portable and a new system can be recovered from the on disk store.

This design has a lot in common with the application independent approach to storage described by the Oxford Common File Layout (OCFL) and the idea of data packaged with metadata that is formalised in the Research Object Crate spec (RO-Crate).

Accordingly, in 2019, with the support of a very small grant from the Australian Research Data Commons (ARDC), Nick Thieberger (director of PARADISEC) and I started working with Peter Sefton and the eResearch team at UTS to develop a proof of concept demonstrator of a next generation language archive using OCFL to store the content and RO-Crate to describe it. At this point Peter and his team had significant experience working with OCFL filesystems and RO-Crates in their own applications and their expertise aligned well with our goals.

When thinking about what a next generation catalog might look like we were drawn to the guarantees of completeness offered by OCFL (i.e. a repo can be rebuilt from its filestores) in addition to features like data integrity, versioning and diversity of underlying storage. Much of the content housed by PARADISEC can never be collected again so it is crucial that we can verify the integrity of the content whilst also supporting easy movement / replication of objects; capabilities either offered natively or easily supported within OCFL.

Describing the data using RO-Crate makes the content even FAIRer than it already is by using an open standard and working in accessible formats (JSON linked data – JSON-LD). The demonstrator that we developed to test these ideas currently has > 70TB of content described as RO-Crates and stored as OCFL objects. It comprises a single page webapp (SPA) served via an nginx server along with an elastic search service. Deployed as docker containers the service is easy to manage and trivial to scale. That said, in the time we’ve been developing it we have not seen the need to scale it in any way. Further, the service remains performant regardless of the amount of data in the underlying OCFL filesystem. Traditional API based repository technologies can suffer performance degradation as they scale to ever larger sizes but we have not seen this in our demonstrator with these technologies.

So how does this relate to the Science Mesh?

As it turns out, the CS3MESH4EOSC project, which is behind the Science Mesh, has been thinking about similar issues. What tools are required to make research data FAIR and how do you apply them at scale to handle the often massive datasets coming from scientific collaborations? The research data management lifecycle consists of a number of phases typically focussed on collection and analysis but data description to support long term preservation is not always well defined; in many cases, it’s not even considered. Not so PARADISEC who have had processes and systems in place to ensure appropriate description from the very beginning of the collection / analysis cycle.

By collaborating with the Science Mesh, we (PARADISEC, Peter Sefton) are helping to define and build the tools that form part of a scalable, performant and well described data ecosystem. Indeed, one of the contributions from our partners at UTS is Describo; a tool for creating and updating RO-Crates. Describo allows researchers to describe their data using an open community standard so that it is ready for preservation, dissemination and reuse. Describo is the tool to make data FAIR.

The Science Mesh, as a partner in the development of Describo, will be adapting it to work with the mesh infrastructure to form a key description tool available to all users of the mesh. Further, conversations about what to do with the soon-to-be well described data are already occurring. Where does the data go when it needs to be published / archived / shared and how does that process look? What types of new services are enabled by using these open standards and having the data in this form?

Indeed, these questions are being considered by an even wider community of global RO-Crate and OCFL users. So to that end, it’s humbling to think that a demonstrator born from a small Australian research grant has opened up access to an international community of researchers, developers and systems all facing similar challenges and collaborating to develop the next generation of scalable and performant FAIR data services.

More information

For more information on PARADISEC and the team’s leadership in this area, please contact Marco La Rosa or the AARNet eResearch team.

From endangered languages to the Science Mesh

More information

Join our mailing list for news, events and more