"Cloud computing allowed us to speed up the quality control process. We collaborated with Verily and the Broad Institute and used the Terra platform to both implement a standard processing workflow for whole genome sequence data and perform the quality control measurements needed for us to be confident in the data to be shared with the research community. We expect researchers will find these and other notebooks valuable resources for their own analytical pipelines. We look forward to future collaborations and seeing what tools the community creates for the analysis of this data."
- Lead WGS WG Scientist
The AMP PD public-private partners, through an AMP PD whole Genome Sequencing Working Group (WGS WG), identified quality control (QC) measurements that would be applied to AMP PD WGS data to ensure that the information from the thousands of whole genome sequences generated met inclusion criteria for release on the AMP PD Knowledge Platform.
To make this activity possible, and after a due diligence process, the WGS WG elected to focus this QC work in Jupyter Notebooks run on the Terra platform to enable an efficient collaboration around the analysis of 4,047 genomes.
As part of the QC these the WGS WG tearm performed concordance checks with NeuroX data and with the gender identified in associated clinical data. As a result of this effort these Jupyter notebooks are now available in the AMP PD Terra workspace for others in the research community to use. This approach ensures transparency and reproducibility of AMP PD data.
Accelerating Discovery Through Cloud Computing
"Cloud computing allowed us to speed up discovery. We collaborated with Verily and the Broad Institute to test varying implementations of the standard processing pipeline for exome sequence data on both the cohort and population scale."
- AMP PD Collaborator
To make real scientific discoveries possible from so many sources of data, the data had to be reanalyzed for consistency. To reduce the possibility of technical artifacts, scientists had to perform realignment, recalibration, and re-genotyping of exomes. But there was a problem: none of the consortium members had enough local computational resources.
The team decided to use a fully managed service on the Google Cloud Platform. Scientists ran the Broad Institute’s GATK Best Practices pipeline using Google Genomics, processing the the exomes—starting with raw, unaligned sequence data and leading to a set of variant calls—in just three and a half weeks. The dataset was subsequently used to identify six new risk loci for Parkinson’s disease, helping scientists better understand genetic risks for the disease.
Even if hardware could have been procured, the effort would have taken months of compute time using local infrastructure. With the Google Cloud Platform, massive datasets can now be analyzed, giving scientists access to virtually unlimited compute resources for large-scale projects.