ncibtep@nih.gov

Bioinformatics Training and Education Program

Pathways and gene sets: What is functional enrichment analysis?

Whether you are measuring mRNA expression, protein expression, DNA methylation, expressed miRNAs, protein binding to DNA or RNA, etc., you will likely end up with a list of genes or gene products from which you would like to derive functional relationships. In the -omics world, functional enrichment analysis is an umbrella term encompassing approaches used to derive biological / functional meaning from gene lists. The purpose of this spotlight is to shed light on what we mean by functional enrichment analysis and point out some methodological limitations and other considerations.

Functional enrichment analysis can easily be used to integrate different types of data and identify mechanisms, functions, processes, targets, or regulators of cancer and disease. However, results are sensitive to the quality of the data going into an analysis, the method(s) used, selected background genes for specific methods, and the knowledge-base(s) used to inform these methods. See Zhao and Rhee 2023 and Geistlinger et al. 2021 for a more in-depth discussion. There are three general approaches to functional enrichment analysis: 1. Over-representation analysis (ORA), 2. Functional class scoring (FCS), and 3. Pathway topology (PT)

Over-representation analysis compares the proportion of genes associated with a gene set found in an input list versus the proportion of genes in a background gene list to determine whether a gene set is over or under-represented; a p-value is generally assigned using a fisher’s exact test, chi-squared test, or similar statistical method. Over-representation analysis is conceptually easy to understand but is limited due to the arbitrary thresholds used to define input lists and statistical assumptions such as gene independence, which rarely holds true.  These methods are also sensitive to the size of gene lists, performing much better when gene lists exceed a size of 50. In a review of 13 compared methods, ORA methods performed the worst with a greater degree of false positives.

Functional class scoring, which includes rank-based methods (e.g., GSEA, GSA), are more sensitive than ORA methods. Rather than supplying a gene list based on an arbitrary threshold (e.g., fold change, p-value), FCS methods consider the entire data set. In a rank-based approach, the distribution of genes from a particular gene set at the top or bottom of the ranked list ultimately determines the significance of a pathway or gene set. However, if the user only has a list of genes with no apparent ranking order, an FCS approach is inappropriate.

Both ORA and FCS discard a large amount of information about a pathway. These methods can more aptly be described as gene set approaches. Gene sets are collections of genes “formed on the basis of shared biological or functional properties as defined by a reference knowledge base. Knowledge bases are database collections of molecular knowledge which may include molecular interactions, regulation, molecular product(s) and even phenotype associations.”  Some examples of knowledge bases include GO, KEGG, Reactome, BioCarta, Pathway-Commons, WikiPathways, and PANTHER. In contrast, a pathway is not a list of genes but rather includes an interaction component usually related to a specific mechanism or process.

On the other hand, topology-based methods consider structural information such as gene product interactions, positions of genes, and the types of genes within a pathway. These methods (e.g., impact analysis, topology-based pathway enrichment analysis (TPEA)) have shown to produce more accurate results when the user wishes to understand the types and directions of gene interactions and underlying mechanisms. However, such methods often “require experimental evidence for pathway structures and gene–gene interactions, which is largely unavailable for many organisms.”

Lastly, it is important to note that many of the classic methods developed within the above approaches were designed specifically for transcriptomic data. Other types of -omic data (e.g., proteomics, metabolomics, scRNA-seq, GWAS) differ intrinsically from transcriptomic data, and these differences should be considered when selecting a method. For methods/tools specific to proteomics, GWAS, epigenomics, and multi-omics, see Zhao and Rhee 2023.

Popular tools:

  1. Gene Set Enrichment Analysis (GSEA)
  2. iPathwayGuide (proprietary) (PT)
  3. ROntoTools (R package) (PT)
  4. Qiagen IPA (proprietary, CCR license) (ORA)
  5. Qlucore (proprietary, CCR license) (GSEA)
  6. Webgestalt (web-based) (ORA, GSEA)
  7. clusterProfiler (R package) (ORA, GSEA)
  8. DAVID (web-based) (ORA)
  9. g:Profiler (R and web-based) (ORA)
  10. Enrichr (web-based) (ORA)
  11. iDEA (R package) (scRNAseq)
  12. TPEA (R package) (PT)

Note: We are not endorsing any one specific tool. Please be cautious in choosing a method for your data analysis. You should be aware of the method and limitations of any tool used.

BTEP Training materials:

  1. Bioinformatics for Beginners Module 3: Pathway Analysis
  2. Functional Enrichment Analysis with clusterProfiler
  3. Training materials related to QIAGEN IPA and Qlucore Omics Explorer can be found in the BTEP Video Archive.

— Alex Emmons (BTEP)