Gene ontology and pathway analysis
Objectives
- Determine potential next steps following differential expression analysis.
- Tour geneontology.org and understand the three main ontologies.
- Learn about different methods and tools related to functional enrichment and pathway analysis.
- Get familiar with databases commonly used by popular functional enrichment tools.
Where have we been and where are we going?
Thus far we have:
- Downloaded raw RNA-Seq data (.fastq files).
- Examined raw data quality using
fastqc
andmultiqc
. - Performed adapter and quality trimming using
Trimmomatic
. - Aligned the raw sequences to a reference genome (human chromosome 22 from the GRCh38 version of the human reference genome) using
HISAT2
. - Viewed and compared alignments using IGV.
- Generated a gene count matrix using
featureCounts
. - Performed differential expression analysis (
DESeq2
). - Generated a heatmap of differentially expressed genes.
Heatmap of differentially expressed genes (Normal vs Tumor).
You now have a potentially large list of differentially expressed genes. Now what? If you are like most biologists, you are interested in understanding these genes within their biological context.
To do that, we can examine gene ontology and perform some type of functional enrichment analysis or pathway analysis.
These types of analyses exploit the use of gene sets, and not all gene sets represent a pathway. Gene sets are collections of genes "formed on the basis of shared biological or functional properties as defined by a reference knowledge base. Knowledge bases are database collections of molecular knowledge which may include molecular interactions, regulation, molecular product(s) and even phenotype associations" Mathur et al. 2018.
Whereas, a pathway is not a simple list of genes but rather includes an interaction component usually related to a specific mechanism, process, etc.
What is gene ontology?
Many of the tools used to understand functional enrichment will use sets of GO terms, examining GO enrichment. What do we mean by GO?
The Gene Ontology (GO) provides a framework and set of concepts for describing the functions of gene products from all organisms. --- https://www.ebi.ac.uk/ols/ontologies/go.
This is manually curated by team members of the GO consortium.
There are two parts to the gene ontology: (Check out https://www.youtube.com/watch?v=6Am2VMbyTm4 for a more detailed overview)
- the ontology (the GO terms and their hierarchical relationship) - form a directed, acyclic graph structure (nodes = GO terms, edges = relationships)
- the annotations (the annotated genes linked to various GO terms)
Image from https://www.ebi.ac.uk/QuickGO/GTerm?id=GO:0031436 .
What is a GO term?
- GO terms provide information about a gene product
- GO terms as a vocabulary are species agnostic, but there are species constraints
- ontology and annotations are updated regularly
- computer readable - suitable for bioinformatics
GO integrates information about gene product function in the context of three domains:
- molecular function (F) - "the molecular activities of individual gene products" (e.g., kinase)
- cellular component (C) - "where the gene products are active" (e.g., mitochondria)
- *biological process (P) - "the pathways and larger processes to which that gene product’s activity contributes " (e.g., transport)
*Commonly used for pathway enrichment analysis
What is GO Slim? Reducing Semantic Similarity
There is a lot of redundancy in pathway analysis results, and there are different tools that can be used for reducing this redundancy (e.g., DAVID's Functional Annotation Clustering, REVIGO, GOSemSim). Alternatively, users can focus on sets of terms at different levels in the GO hierarchy, which range from broad to more specific. There are also GO Slim terms, which are simplified subsets from GO. GO slim terms are useful for providing a high level summary of functions.
Checkout geneontology.org to learn more about GO.
Approaches to gene set analysis / pathway analysis?
Functional enrichment and pathway analysis have broad and varying definitions. For our purposes, there are three general approaches: 1. Over-Representation Analysis (ORA), 2. Functional Class Scoring (FCS), and 3. Pathway Topology (PT) (Khatri et al. 2012).
Examining genes in a set allows us to:
- increase the statistical power in our analysis
- ease interpretation
- predict new roles for genes
- better integrate data from different methods
Over-representation analysis (ORA)
statistically evaluates the fraction of genes in a particular pathway found among the set of genes showing changes in expression --- Khatri et al. 2012
From this, ORA determines which pathways are over or under represented by asking "are there more annotations in the gene list than expected?"
Things to know:
- Tests based on hypergeometric, fisher's exact, chi-square, or binomial distribution; these determine the probability that the number of genes in our gene list and found in a given gene set are observed by chance.
- Assumes independence between genes.
- Requires an appropriate background gene set for comparison.
- This could be:
- all genes in the organism of interest
- all protein coding genes
- only the genes that are measured (microarray)
- only the genes that are expressed (RNA-Seq)
- genes in a gene set collection
- Prioritizes a subset of genes using an arbitrary, user determined threshold.
- Doesn't require the data, just the gene identifiers.
- Example tools include DAVID and Qiagen IPA.
To understand the statistics behind ORA, see this guide from Pathway Commons and this video from Biostatsquid.
Functional Class Scoring (FCS)
Includes ‘gene set scoring’ methods such as GSEA, which first compute DE scores for all genes measured, and subsequently compute gene set scores by aggregating the scores of contained genes. --- Geistlinger et al. 2021
These methods are more sensitive than ORA methods, but are more challenging to implement.
GSEA
- Ignores gene position and role
- Does not pre-select genes; considers all gene expression in the form of a ranked list. You must include data with gene identifiers for ranking.
- Ranking by magnitude of change in gene expression between conditions and p-value.
- Top = upregulated + significant
- Bottom = down-regulated + significant
- Middle = non-significant
- Determines where genes from a gene set fall in the ranking (i.e., at the top of the list or the bottom of the list).
- Creates a running sum statistic
- uses a permutational approach to determine significance (e.g., Kolmogorov-Smirnov)
- Broad Institute software but also available using web-based tools, R (See fgsea for a modified approach), and Qlucore (proprietary).
- also considered a strategy encompassing a range of methods
- self-contained methods vs competitive methods
Check out this video for a nice overview of GSEA.
What is MSigDB and how does it relate to GSEA?
The Molecular Signatures Database (MSigDB) is a curated resource of thousands of gene sets by the Broad Institute. These sets were curated for use with GSEA software but are used with other tools as well.
- Includes both human and mouse collections.
-
There are 34,837 gene sets in the Human Molecular Signatures Database (MSigDB) (not all gene sets are related to pathways).
- Includes 9 larger, themed collections
- Highlighted examples:
- C5 - the gene ontology (GO) collection
- C2 - curated gene sets from publications and pathway databases (e.g., KEGG and REACTOME).
- Hallmark collection - a summary list of well-defined gene sets with decreased redundancy. A good choice for many studies.
- C7 - great for immunological research.
Pathway Topology
ORA and FCS discard a large amount of information. These methods use gene sets, and even if the gene sets represent specific pathways, structural information such as gene product interactions, positions of genes, and types of genes is completely ignored. Pathway topology methods seek to rectify this problem.
PT methods are mostly considered network based.
Some examples:
Impact analysis (iPathwayGuide)
constructs a mathematical model that captures the entire topology of the pathway and uses it to calculate a perturbation for each gene. Then, these gene perturbations are combined into a total perturbation for the entire pathway and a p-value is calculated by comparing the observed value with what is expected by chance. (https://advaitabio.com/ipathwayguide/more-accurate-pathway-rankings-using-impact-analysis-instead-of-enrichment/)
Other tools include Pathway-Express, SPIA, NetGSA, TPEA, etc. (See Nguyen et al. 2019 for a review of PT methods.)
These methods have shown to produce more accurate results when the user wishes to understand the types and directions of gene interactions and underlying mechanisms. However, such methods often “require experimental evidence for pathway structures and gene–gene interactions, which is largely unavailable for many organisms.”
What tools are available?
This is neither a comprehensive list of tools nor an endorsement of certain tools, but rather a list of semi-popular tools with different approaches.
Note
There are a ton of tools out there. Be aware of the background methods used and the quality of results returned. Also, check to see when the tool was last updated.
Table from Geistlinger et al. 2020
Other and related tools
- Gene Set Enrichment Analysis (GSEA)
- EnrichmentMap
- REVIGO (reducing and visualizing gene ontology)
- Pathview
- iPathwayGuide (proprietary)
- Qiagen IPA (proprietary, CCR license)
- Qlucore (proprietary, CCR license)
- GeneMANIA
- CellNetAnalyzer
- PARADIGM
Other databases
There are many databases devoted to relating genes and gene products to pathways, processes, and other phenomenon. Again, the following is not meant to be a comprehensive list.
Kyoto Encyclopedia of Genes and Genomes (KEGG)
- Curated database
- Biological pathways
- Molecular interaction networks
- Includes very nice pathway maps - metabolic pathways, disease pathways, drug-target interactions, etc.
- System-level integration
- Restricted licenses
The Reactome Knowledgebase systematically links human proteins to their molecular functions, providing a resource that functions both as an archive of biological processes and as a tool for discovering novel functional relationships in data such as gene expression studies or catalogs of somatic mutations in tumor cells. --- Jassal et al. 2019
- Curated database including metabolism, signaling, and other biological processes
- Human specific
- Also includes disease super-pathways
- Several built-in pathway analysis tools
- A meta-database of pathways from other pathway databases
- Standardized format
The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis. The core of PANTHER is a comprehensive, annotated “library” of gene family phylogenetic trees. --- pantherdb.org/about.jsp
- Especially useful when considering evolutionary relationships
- Community driven meta-database of pathways
- A great source of less well-established pathways
The NDEx Project provides an open-source framework where scientists and organizations can store, share, manipulate, and publish biological network knowledge. - ndexbio.org
HumanCyc (See BioCyc)
HumanCyc provides an encyclopedic reference on human metabolic pathways, the human genome, and human metabolites. --- humancyc.org
- A knowledgebase of genes and associated diseases.
- Looking for a specific database? Pathguide contains a resource list of pathways searchable by organism and resource type.
- Most recent update was 2017
Importance of Gene IDs
To use various tools for functional analysis, you will need a list of annotated genes. Gene annotations come in a variety of flavors and not all flavors are compatible with every tool. For example, Gene Ontology (GO) is associated with Entrez, Ensemble, and offical gene symbols (assigned by the HUGO Gene Nomenclature Committee (HGNC)).
Note: Genome builds will have differences in the names and coordinates of genomic features, which will impact gene ID conversions. See this tutorial from the Harvard Chan Bioinformatics Core.
Some tools to help with annotation / conversion:
- g:Convert
- Ensembl Biomart
- AnnotationHub
- DAVID Gene ID Conversion
Other Considerations
- Describe what method(s) you use for functional enrichment or pathway analysis clearly in any resulting publications. This should include parameters used and software versions.
- For ORA, select appropriate background genes.
- Use p-values corrected for multiple comparisons. If more than one gene set is tested, the p-values should be corrected to minimize false positive.
- Many of the classic methods developed within the above approaches were designed specifically for transcriptomic data. Other types of -omic data (e.g., proteomics, metabolomics, scRNA-seq, GWAS) differ intrinsically from transcriptomic data, and these differences should be considered when selecting a method. For methods/tools specific to proteomics, GWAS, epigenomics, and multi-omics, see Zhao and Rhee 2023.
Resources:
- ClusterProfiler tutorial
- Functional enrichment and comparison with R .
- ClusterProfiler, pathview, and good introductory information
- Article on the impact of the evolving GO
- Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges, PLOS Computation Biology, 2012
- Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap
- Toward a gold standard for benchmarking gene set enrichment analysis, Briefings in Bioinformatics, 2021
- Introductory lectures on functional enrichment and R from DIY Transcriptomics