Cancer research is a complex and data-intensive field. Cloud computing offers a powerful solution for researchers to store, analyze, and share large datasets efficiently. In this month’s topic spotlight, we will explore cloud resources available to NCI researchers.
First, let’s take a step back and define cloud computing. Cloud computing is essentially access to IT services over the internet. With cloud computing, IT resources are accessible on demand with a pay as you go model, meaning you only pay for what you use. Other advantages of cloud computing include
- flexibility and scalability – the ability to scale up or down computational resources based on need.
- Access to pre-configured ecosystems – these may include data and specialized software and tools to conduct different types of analyses.
- Ease of collaboration – cloud computing provides access to computational resources wherever internet is available. This could facilitate collaboration with extramural researchers.
- Security – compliance with government regulations to protect sensitive data.
Cloud computing provides access to massive datasets like those found in cloud-based data repositories such as the Cancer Research Data Commons (CRDC) and All of Us Research Hub. Additionally, cloud computing empowers researchers to scale computational resources dynamically, enabling them to tackle the analysis of millions of data points from thousands of patients. However, many researchers may be reluctant to adopt cloud-based resources due to various challenges including information overload, inadequate training, and uncertainties regarding costs. The lack of coordination and integration among NIH Institutes and Centers can hinder the discovery of available resources. Even when resources are identified, their usage often remains unclear due to insufficient training. Furthermore, while cloud-based resources often follow a pay-as-you-go model, the underlying cost calculation models can be obscure, leading to the circulation of alarming tales of exorbitant bills.
Hopefully, some of these concerns can be alleviated here by highlighting some available cloud resources and complementary training. This, unfortunately, is not a comprehensive list of available resources. However, the resources listed here are open to NCI researchers.
NIH wide resources:
STRIDES
One of the most notable resources for access to Cloud resources and services is the NIH STRIDES (Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability) Initiative. With the vision of connecting the NIH research community with technologies that accelerate discovery, STRIDES is a partnership with commercial Cloud Service Providers (CSPs) to allow NIH-supported researchers (intramural and extramural) affordable access to cloud services and environments. The STRIDES Initiative accelerates biomedical research in the cloud by simplifying access, reducing costs, lowering technological barriers, and improving processes.
If you are currently evaluating whether the Cloud is right for your project or use case, STRIDES provides access to a cloud-based testing and learning environment called NIH Cloud Lab. Cloud Lab is a no-cost, 90-day program for NIH-affiliated researchers to try Amazon Web Services, Google Cloud, or Microsoft Azure in an NIH-approved environment. Users of any skill level can explore cloud computing capabilities and have access to a wide range of tutorials. For more classroom style training, STRIDES offers discounted rates on Instructor Led Training (ILT) via their CSP learning partners. For more information on how to purchase training or explore free self-paced training options visit the STRIDES Training Resources site.
Once you are ready to transition to the cloud, the STRIDES team can help you make the switch to an enterprise environment with the CSP of your choice. To facilitate this and for any other questions concerning the STRIDES Initiative, reach out to the STRIDES team at STRIDES@nih.gov.
NCI Resources:
NCI’s Cloud Resources offer convenient platforms for accessing cancer-specific datasets, publicly available tools, and collaborative workspaces. These platforms require varying levels of experience with the command line and bioinformatics.
Note: Regardless of your experience level, understanding the tools and parameters used for any given analysis is helpful for customizing your workflow. Though, pre-built workflows are available for specific data types to help you get started quickly.
Seven Bridges Cancer Genomics Cloud (SB-CGC)
- Great for users with or without command line experience.
- Can add custom tools and workflows using a GUI; uses common workflow language (CWL).
- Over 1,000 tools and workflows ready to use.
- Includes interactive web apps for visualizing data such as OmicsCircos.
- Integration with JuptyerLab, RStudio, SAS, and Galaxy.
- Supports analysis on Amazon Web Services (AWS), Google Cloud Platform ( GCP), and Azure.
- Training:
- CGC documentation
- CGC Onboarding part 1 and part 2
- Monthly webinar series
- Check out this webinar on building your own tools with SB-CGC.
- Bi-weekly office hours are Tuesday at 10:00 AM and Thursday at 2:00 PM (Eastern US Time)
- The next iteration of the BTEP course Bioinformatics for Beginners (coming January 2025), will walk researchers through all steps of an RNA-Seq analysis using the SB-CGC platform.
Broad Institute’s Firecloud
- Powered by Terra and includes integration with CRDC projects and datasets and others within the Terra ecosystem (e.g., Human Cell Atlas, the All of Us Research Program, and AnVIL).
- Includes production ready pipelines but also facilitates interactive analysis and visualization through Jupyter Notebooks, RStudio, Galaxy, and IGV.
- Uses Google Cloud Platform or Azure
- Training:
ISB’s Cancer Gateway in the Cloud (ISB-CGC)
- Requires greater experience with the command line or willingness to learn. Users can use R, python, and SQL for creating custom scripts. There is a web application for those with less computational experience.
- Greater flexibility regarding the workflow language (CWL, WDL, Snakemake, Nextflow, etc.)
- Includes Google Cloud Platform native tools and technologies including Google BigQuery for big data analytics and Google Compute Engine for complex workflow execution.
- Training:
- ISB-CGC Documentation
- Introductory videos
- Tutorials and How-To Guides
- Analyzing Cancer Data from the CRDC in the Google Cloud with the ISB-CGC Cancer Gateway in the Cloud (hosted by BTEP)
- Using Google BigQuery and R to Analyze TCGA Data from the NCI Genomic and Proteomic Data Commons (hosted by BTEP)
Cloud platforms and resources have the potential to revolutionize cancer research. By actively exploring and adopting these resources, NCI researchers can streamline their workflows, enhance data analysis, and foster collaborative efforts. If you have not yet done so, we encourage you to check out these platforms and consider how they could advance your research. For bioinformatics-related questions, questions concerning the content of this spotlight, or requests for training on a particular bioinformatics topic, please email us at ncibtep@nih.gov. Also, check out the NIH Bioinformatics Calendar for upcoming trainings and the BTEP Video Archive for video recordings of past events.
– Alex Emmons
*STRIDES content provided by the STRIDES Team