How Broad and Verily are creating tools together to enable data sharing and analysis

Since we started working on Google Genomics in 2013, I’ve been excited about the potential for cloud-scale data platforms to accelerate population-scale life science. A key collaborator in that effort has been the team at the Broad Institute, who bring a shared vision and complementary strengths. At Verily, that collaboration continues -- see below for an update on how the vision is growing and becoming real.

The post below originally appeared on the Broad Institute blog.
Posted by Kristian Cibulskis, Director, Platform Engineering, Broad Institute of MIT and Harvard

Today, we are writing to share how the collaboration between the Broad Institute and Verily Life Sciences is expanding to create and disseminate open-source tools for storing, sharing, and analyzing life sciences data.

We have been working with Verily on several efforts for more than a year.

First, both of our organizations are among a group of National Institutes of Health (NIH) funded collaborators (including Vanderbilt University, University of Michigan, Columbia, and others) developing open source computational tools for the All of Us Research Program, NIH’s effort to gather data from one million or more people living in the United States to accelerate research and improve health.

Second, we have worked with Verily to migrate our production data processing environment for genome sequencing to the cloud. This is based on earlier work with Google Genomics, whereby we optimized its overall framework for running pipelines to shift away from an environment that was heavy on local computing and storage, to one operating in cloud environments. This was done by creating two new components--the Cromwell workflow engine and the Pipelines API--that are now used to process all genomic data that the Broad generates (read more in this 2016 Google Research Blog post). Now, we are helping Verily adopt these same analytical tools, as well as incorporate Broad’s sequencing center’s best practices, for its own genome sequencing operations.

In the course of the collaboration, we experienced firsthand that impactful science increasingly involves collaborations across institutions and countries, and it no longer makes sense for every group to develop analytical tools and data environments in isolation. We both believe that tools should be interoperable and openly available—to avoid needless duplication, and to maximize the opportunities for data sharing and rapid adoption of tools, as well as to avoid needless sending, copying, and storing of vast amounts of information.

A key consideration is ensuring that the software is open source. The Broad Data Sciences Platform is already committed to making all of its software open source. Verily is also committed to contributing to the open source community and has agreed to fund some software engineers at Broad to create open source tools.

Both organizations have contributed to genomic variant callers that we share freely. Broad has long made its GATK software widely available, and recently changed the licensing so that it is now open source. Verily and the Google Brain Team recently made DeepVariant, a variant caller that uses deep neural networks, available via open source on GitHub, and Verily will provide additional opportunities to support open source software development for the life sciences through efforts like Summer of Code.

Our collaboration follows the Data Biosphere framework, a set of principles recently described by authors in academia and industry. These principles promote open-source sharing of tools and standards-based interoperability in computational biology.

In addition to always ensuring that the data itself is secure, the Data Biosphere framework includes a stewardship principle that biomedical data environments should act as data custodians, not data owners—that is, that software services should not presume to have the right to use, sell, or control access to third-party data. In contrast to consumer technology products, medical data entails greater responsibilities to patients and participants—including to protect patient privacy and ensure that only appropriate secondary use of data is permitted.

We recently released the first two components created from our collaboration--a user interface for monitoring batch processing jobs, and a service to facilitate launching Jupyter notebooks. We plan to add new components in the coming months and years.

Importantly, this collaboration is just one component of an ecosystem of collaborations. We are part of a network of research groups that is working together, in various combinations, on a number of flagship scientific projects, including All of Us, the Human Cell Atlas, the NCI Genomic Data Commons and Cloud Resources, and the NIH Data Commons. All of these groups embraced the Data Biosphere principles of open-source licensing, modularity, standardization, and community engagement.

We hope researchers find these open-source components useful.

by David Glazer, Engineering Director, Verily