Almost every person has a story of cancer affecting friends or family. Cancer is devastating and ubiquitous, and treating it remains one of the greatest challenges of modern medicine.
Part of the challenge of treating cancer is that it is not a single disease. Cancer is a group of around 100 different diseases, each with different causes and effects. The common theme that links these different diseases is the uncontrolled spread of damaged or mutated cells. Within each of these cancer types, no two cancers are exactly the same as each other.
One possible avenue for finding better cancer treatments is to study cellular proteins to learn how they interact with different cancer treatments. ProCan, one of many research projects run by the Children’s Medical Research Institute (CMRI), aims to study these cellular proteins.
Studying the proteome
ProCan’s goal is to study the proteomes of tens of thousands of cancer samples. Each cell has millions of protein molecules that function as both the machinery and the architectural components of the cell. The complete list of the different types of proteins at any one point in time is called the proteome.
As cancers consist of mutated cells, they too have millions of protein molecules, although their type and abundance differ from healthy cells. By studying the proteomes of different cancers, ProCan intends to develop better cancer diagnostic tests and treatments.
“It’s a seven-year project with more than 30 scientists. We’ve analysed over ten thousand cancer samples already, sourced from different age groups all over the world,” said Operations Manager and Acting Team Leader for ProCan’s Computational Science group, Edith Hurt.
“The knowledge base we’re developing will eventually be used by cancer clinicians to find the best treatments for specific cancers, without losing time through the need to trial different medications.”
The transfer chokepoint
ProCan relies on the creation and processing of large data sets. The team runs a host of analysis equipment, including six mass spectrometers, which generate roughly 100 Gigabytes of data per day.
The team also creates thousands of digital images – up to 10 Gigabytes each in size – and stores metadata at different stages of their work.
While ProCan’s locally hosted High Performance Computing (HPC) environment is capable of handling day to day loads, processing of large cancer cohorts or cross cohort analysis requires additional resources that can only be both practically and economically provided by cloud computing.
To solve this two-speed computing problem, ProCan developed a hybrid cloud implementation that enables computing tasks to be seamlessly executed on either HPC or cloud resources based on operational needs. The existing network infrastructure was a major chokepoint for the hybrid cloud implementation as well as ProCan’s backup strategy.
On the advice ProCan’s Backend Engineer Michael Hecker, CMRI partnered with AARNet to overcome this limitation.
“Without external resources, compute jobs took too long for our researchers to perform any kind of rapid prototyping or testing of their workflow,” CMRI’s Head of ICT Mike Baker said.
“We considered getting more local computing resources, but it just wasn’t economically viable.”
The partnership with AARNet provides CMRI with a 10 gigabit-per-second internet connection and direct connections to leading cloud service providers so that researchers can quickly and easily access the compute and storage resource they need.
Coping with data movements
For ProCan, the fast turn-around is particularly important because of the project’s size. Mass spectrometry readings from the lab produce raw data files that are sent to the Cancer Data Science team, who run them through a variety of computational methods on the hybrid cloud implementation.
Once the sample data have been processed, the proteins present in each sample can be identified and quantified. These proteins are then checked against patient data to see which treatments worked on this type of cancer. This includes information on the patient’s molecular profiles, treatment type and scheduling, clinical outcomes, and cancer imaging.
All of this information is combined using more computational methods and algorithms, returning results which can help build a proteomic picture of the disease and treatment options. At each stage, vast amounts of data are moved.
“Being able to transfer data without any congestion allowed us to run these workflows. The turn-around times otherwise would be impossible. It has meant we’ve been able to set up a hybrid cloud computing environment that’s efficient enough to be invisible in the day-to-day operations of our researchers,” said Baker.
The last year has seen ProCan develop new techniques to improve their workflow, as well as developing computational methods for acquiring consistent data from six mass spectrometers operating continuously over long time periods. In one four-month period, a block of eight different samples was subjected to 1560 mass spectrometry runs and interspersed with 5000 runs of other samples, to determine the reliability of the technologies developed by the team. The results have been published in the prestigious journal Nature Communications.
The project is continuing its high throughput proteomics, incorporating more samples and treatment sets. This makes Procan’s AARNet connection even more vital for seamless research workflow.
AARNet is developing specialised infrastructure that not only meets the Health and Medical sector’s unique and evolving needs for fast, reliable and scalable connectivity, but also for storage, analysis, management and archive of sensitive data, with CMRI partnering with AARNet on this initiative.