We are delighted to share that our collaboration with the Google Brain team on applying machine learning to genomics is now available as open source. DeepVariant uses deep learning to reconstruct genome sequences from high-throughput sequencing data with significantly greater accuracy than prior methods. The development team has worked closely with Google Cloud Platform (GCP) to make it easy to run DeepVariant workflows on GCP, providing flexibility and scale of execution.
Below is the original announcement posted on Google Research and Open Source blogs.
By Mark DePristo and Ryan Poplin, Google Brain Team
Across many scientific disciplines, but in particular in the field of genomics, major breakthroughs have often resulted from new technologies. From Sanger sequencing, which made it possible to sequence the human genome, to the microarray technologies that enabled the first large-scale genome-wide experiments, new instruments and tools have allowed us to look ever more deeply into the genome and apply the results broadly to health, agriculture and ecology.
One of the most transformative new technologies in genomics was high-throughput sequencing (HTS), which first became commercially available in the early 2000s. HTS allowed scientists and clinicians to produce sequencing data quickly, cheaply, and at scale. However, the output of HTS instruments is not the genome sequence for the individual being analyzed — for humans this is 3 billion paired bases (guanine, cytosine, adenine and thymine) organized into 23 pairs of chromosomes. Instead, these instruments generate ~1 billion short sequences, known as reads. Each read represents just 100 of the 3 billion bases, and per-base error rates range from 0.1-10%. Processing the HTS output into a single, accurate and complete genome sequence is a major outstanding challenge. The importance of this problem, for biomedical applications in particular, has motivated efforts such as the Genome in a Bottle Consortium (GIAB), which produces high confidence human reference genomes that can be used for validation and benchmarking, as well as the precisionFDA community challenges, which are designed to foster innovation that will improve the quality and accuracy of HTS-based genomic tests.
DeepVariant transforms the task of variant calling, as this reconstruction problem is known in genomics, into an image classification problem well-suited to Google's existing technology and expertise.
We started with GIAB reference genomes, for which there is high-quality ground truth (or the closest approximation currently possible). Using multiple replicates of these genomes, we produced tens of millions of training examples in the form of multi-channel tensors encoding the HTS instrument data, and then trained a TensorFlow-based image classification model to identify the true genome sequence from the experimental data produced by the instruments. Although the resulting deep learning model, DeepVariant, had no specialized knowledge about genomics or HTS, within a year it had won the the highest SNP accuracy award at the precisionFDA Truth Challenge, outperforming state-of-the-art methods. Since then, we've further reduced the error rate by more than 50%.
DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. To further this goal, we partnered with Google Cloud Platform (GCP) to deploy DeepVariant workflows on GCP, available today, in configurations optimized for low-cost and fast turnarounds using scalable GCP technologies like the Pipelines API. This paired set of releases provides a smooth ramp for users to explore and evaluate the capabilities of DeepVariant in their current compute environment while providing a scalable, cloud-based solution to satisfy the needs of even the largest genomics datasets.
DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community. This is all part of a broader goal to apply Google technologies to healthcare and other scientific applications, and to make the results of these efforts broadly accessible.