Enhancing Data Access and Analysis in the Cloud Advances NIH-Supported Discovery

Posted

Guest blog post authored by:

  • Andrea Norris, MBA, Director, Center for Information Technology (CIT), and Chief Information Officer (CIO), NIH
  • Patricia Flatley Brennan, RN, PhD, FAAN, Director, National Library of Medicine, NIH
  • Susan Gregurick, PhD, Associate Director for Data Science, NIH; Director of the NIH Office of Data Science Strategy

To fully benefit from the exponentially growing body of biomedical data, we need cutting-edge approaches that foster data access, analysis, sharing, and collaboration so novel scientific questions can be pursued. But the sheer volume, sometimes siloed nature, along with the costs and time associated with analyzing large datasets, can be difficult for some researchers. Recognizing these concerns, NIH is helping by hosting large data sets and bringing together computational tools and cloud technologies in ways that support open access, interoperability, and collaborative analyses. We encourage you to explore how these resources may help accelerate your research in ways not possible before.

In recent years, NIH has invested in cloud computing and other platforms to create shared environments enabling opportunities for better data access and novel analyses. These investments are positively shifting how researchers interact with COVID-19, genomic, imaging, proteomic, and other NIH-supported large datasets.

The NIH Strategic Plan for Data Science suggests fostering data access and use through investing in cloud computing will have many benefits. Researchers, for one, can retrieve data quicker than before, without needing to copy and store any data locally. They can also leverage existing high compute environments to scale up analyses and improve efficiencies, without needing to build or maintain the underlying infrastructure. And working in the cloud may ease collaboration between government, academia, industry, and other partners to facilitate the translation of basic discoveries into novel treatments.

An image of Two male scientists and a female scientist around a computer in a lab that has lab equipment on counters, cables connecting equipment, and multiple computers.

An NIH initiative that brings together computational tools and cloud technologies for our recipient institutions and supported investigators is the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. Launched in 2018, the STRIDES Initiative is a vehicle for NIH and NIH-funded researchers to access cloud resources through partnerships with commercial cloud providers—making it easier and more cost-effective to access large datasets and computing power. More than 60 NIH research institutions are currently taking advantage of the favorable pricing, training, and technical support available. We encourage you to learn how others are leveraging these services and explore the available opportunities too.

What does the STRIDES Initiative offer? Researchers supported by NIH awards can receive cost-effective access to state-of-the-art, cloud-based data storage and computational capabilities, tools, and expertise. Shared tools, data hosting, and curation help reduce the costs for participants. They also offer learning and development opportunities and customized guidance from experts, all to help you better connect with biomedical datasets, tools, resources, and fellow scientists in new ways.

An image of healthcare icons including a doctor, nurse, brain, eye, and hospital bed on a blue background.

STRIDES also partnered with the National Library of Medicine’s (NLM’s) Sequence Read Archive (SRA) in continued efforts to  ease data access. The partnership made over 36 petabytes of “next generation” (raw and SRA-formatted) sequencing data accessible to anybody via two cloud service providers. Now, you can search their entire catalog of genomic data in the cloud, and even use the computational tools for your analyses.

Before, researchers who wanted to use this archive effectively had to have efficient means to search and retrieve large datasets, sometimes to the scale of 6 petabytes (or 6,000,000 gigabytes), which could take days to download and was only possible for those with access to large-scale storage systems. This led to substantial obstacles and delays, given that timely results could help address public health emergencies (see COVID-19 datasets publicly available here).

These cloud computing services should make using data from NIH-funded research more accessible to researchers in a more timely and cost-effective way. We also hope that, working together with the research community, a robust, interconnected ecosystem can be created that reduces barriers to generating, analyzing, and sharing research data. By moving data to the cloud, we can maximize our investment in research, while also strengthening the transparency, rigor, and reproducibility of our supported science.

One comment

Before submitting your comment, please review our blog comment policies.

Leave a Reply

Your email address will not be published. Required fields are marked *