Skip to content

Functional Enrichment Analysis with clusterProfiler

Learning Objectives

  1. Understand the capabilities of clusterProfiler in the context of functional enrichment analysis
  2. Learn about related packages to improve results reporting

What is Functional Enrichment Analysis?

Functional enrichment and pathway analysis have broad and varying definitions. For our purposes, there are three general approaches (Khatri et al. 2012):

  1. Over-Representation Analysis (ORA)
  2. Functional Class Scoring (FCS)
  3. Pathway Topology (PT)

Over-representation analysis (ORA)

statistically evaluates the fraction of genes in a particular pathway found among the set of genes showing changes in expression --- Khatri et al. 2012

From this, ORA determines which pathways are over or under represented by asking "are there more annotations in the gene list than expected?"

  • tests based on hypergeometric, chi-square, or binomial distribution
  • includes GO enrichment methods
  • prioritizes a subset of genes using an arbitrary, user determined threshold (e.g., p-value)
  • doesn't require the data, just the gene identifiers
  • not great for gene lists with less than 50 genes

Functional Class Scoring (FCS)

Includes ‘gene set scoring’ methods such as GSEA, which first compute DE scores for all genes measured, and subsequently compute gene set scores by aggregating the scores of contained genes. --- Geistlinger et al. 2021

GSEA

  • ignore gene position and role
  • do not pre-select genes (considers all gene expression) and so you must include data with gene identifiers for ranking
    • ranking by magnitude of change in gene expression between conditions
    • determines where genes from a gene set fall in the ranking
    • creates a running sum statistic
    • uses a permutational approach to determine significance
  • Broad Institute software but also available using web-based tools, R, and Qlucore (proprietary).
  • also considered a strategy encompassing a range of methods
    - self-contained methods vs competitive methods

Pathway Topology

ORA and FCS discard a large amount of information. These methods use gene sets, and even if the gene sets represent specific pathways, structural information such as gene product interactions, positions of genes, and types of genes is completely ignored. Pathway topology methods seek to rectify this problem.

PT methods are mostly considered network based.

Some examples:

Impact analysis (iPathwayGuide), Pathway-Express, SPIA, NetGSA, etc. (See Nguyen et al. 2019 for a review of PT methods.)

What is clusterProfiler?

clusterProfiler is an R package used for functional enrichment analysis (either ORA or GSEA). See the latest publication here. The primary documentation for clusterProfiler can be found here.

Advantages of clusterProfiler

  1. ORA and GSEA support
  2. Functionality for multiple ontology and pathway databases
  3. Allows user defined databases and annotations, which is particularly useful for non-model organisms
  4. Easily compare functional profiles between treatments
  5. integrates the tidy philosophy
  6. complementary packages (ChIPseeker,GOSemSim, enrichplot, ReactomePA, DOSE) provide a comprehensive tool suite for understanding functional enrichment.


Figure from T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141

Load dependencies

library(clusterProfiler)  
library(org.Hs.eg.db)
library(tidyverse)
library(DOSE)
library(ReactomePA)
library(enrichplot)

The Dataset

For today's lesson, we will use data from the Bioconductor package airway. The airway data is from Himes et al. (2014). These data, which are contained within a RangedSummarizedExperiment, object are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control. Differential expression testing using DESeq2 was applied to these data.

Load the differential expression results:

dexp<-read.csv("../data/S3_clusterProfiler/deseq2_DEGs.csv")
colnames(dexp)[1]<-"Ensembl"

Create gene lists:

# Extract all significant results for ORA
sigDEGs <- dplyr::filter(dexp, padj < 0.05)
# limiting results by log2FoldChange; may want to use log2FoldChange<0
sig_down<- dplyr::filter(dexp, padj < 0.05 & log2FoldChange < -1)
# limiting results by log2FoldChange; may want to use log2FoldChange>0
sig_up<- dplyr::filter(dexp, padj < 0.05 & log2FoldChange > 1)

# Rank results for GSEA; put in decreasing order.
GSEA_data<-dexp %>% filter(!is.na(padj)) %>%
  arrange(desc(log2FoldChange)) %>% 
  # create a named vector with log fold changes
  pull(log2FoldChange,name=symbol)

Annotations

We have already mapped gene symbols and Entrez IDs to our data using AnnotationDbi and EnsDb.Hsapiens.v75. However, clusterProfiler does offer support for gene ID conversion using the functions bitr() and bitr_kegg(). bitr() uses the OrgDb packages, of which there are 19 databases. bitr_kegg() unsurprisingly uses KEGG organism annotations. Use search_kegg_organism() to find the appropriate kegg code.

For example:

#get keytype info
keytypes(org.Hs.eg.db) 
 [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
 [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
[11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
[16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
[21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
[26] "UNIPROT"     
#need entrez IDs?
ids<-bitr(sigDEGs$Ensembl, fromType="ENSEMBL", 
          toType="ENTREZID", 
          OrgDb="org.Hs.eg.db") #NA values drop by default
'select()' returned 1:many mapping between keys and columns
Warning in bitr(sigDEGs$Ensembl, fromType = "ENSEMBL", toType = "ENTREZID", :
5.31% of input gene IDs are fail to map...
#view 
head(ids)
          ENSEMBL ENTREZID
1 ENSG00000152583     8404
2 ENSG00000165995      783
3 ENSG00000120129     1843
4 ENSG00000101347    25939
5 ENSG00000189221     4128
6 ENSG00000211445     2878

Note

clusterProfiler functions are fully supported by OrgDb and you can provide a Keytype argument within functions.

Over-representation analysis with clusterProfiler

clusterProfiler, along with complementary packages, can easily be used to generate functional enrichment results using over-representation analysis from the following databases: GO, KEGG, DOSE, REACTOME, Wikipathways, DisGeNET, network of cancer genes.

Using ?enrichGO()

GO enrichment refers specifically to gene ontology.

The Gene Ontology (GO) provides a framework and set of concepts for describing the functions of gene products from all organisms. --- https://www.ebi.ac.uk/ols/ontologies/go.
GO integrates information about gene product function in the context of three domains:

  1. molecular function (F) - "the molecular activities of individual gene products" (e.g., kinase)
  2. cellular component (C) - "where the gene products are active" (e.g., mitochondria)
  3. *biological process (P) - "the pathways and larger processes to which that gene product’s activity contributes" (e.g., transport)

*Commonly used for pathway enrichment analysis

GO terms can be highly redundant, and GO enrichment analysis often requires us to deal with this redundancy (See below).

Checkout geneontology.org.

Read more about GO enrichment analysis with clusterProfiler here.

Let's perform GO enrichment on our up-regulated genes. Notice that OrgDB databases and keyType can be specified directly in the function. In addition, you can specify a list of background genes or use the default which will use all genes in the queried database (e.g., GO-BP). Here we are specifying the genes that underwent differential expression testing, but there is currently an open issue on the way the universe is affecting results, which can impact significance values.

ego <- enrichGO(gene= sig_up$symbol,
                OrgDb= org.Hs.eg.db,
                keyType = "SYMBOL",
                ont= "BP",
                universe=dexp$symbol)

Note

If we want to included all three ontologies (i.e., MF, CC, and BP), we can do so with ont= "ALL".

Let's take a peek at the results:

ego
#
# over-representation test
#
#...@organism    Homo sapiens 
#...@ontology    BP 
#...@keytype     SYMBOL 
#...@gene    chr [1:451] "SPARCL1" "CACNB2" "DUSP1" "SAMHD1" "MAOA" "GPX3" "STEAP2" ...
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...53 enriched terms found
'data.frame':   53 obs. of  9 variables:
 $ ID         : chr  "GO:0003012" "GO:0045444" "GO:0030198" "GO:0043062" ...
 $ Description: chr  "muscle system process" "fat cell differentiation" "extracellular matrix organization" "extracellular structure organization" ...
 $ GeneRatio  : chr  "27/351" "18/351" "21/351" "21/351" ...
 $ BgRatio    : chr  "374/14114" "207/14114" "277/14114" "278/14114" ...
 $ pvalue     : num  7.27e-07 4.25e-06 6.02e-06 6.37e-06 6.73e-06 ...
 $ p.adjust   : num  0.00266 0.00493 0.00493 0.00493 0.00493 ...
 $ qvalue     : num  0.00238 0.0044 0.0044 0.0044 0.0044 ...
 $ geneID     : chr  "CACNB2/KLF15/ERRFI1/FOXO1/PDLIM5/LMCD1/LMOD1/NR4A3/SCN7A/ABCC9/OXTR/APBB2/ABAT/FOXO3/CFLAR/ACTA2/LEP/ATP2A2/GDN"| __truncated__ "SORT1/INHBB/ZBTB16/FOXO1/CEBPD/NR4A3/STEAP4/KLF5/LEP/CCDC71L/ALDH6A1/AXIN2/FRZB/NR4A1/LMO3/DIO2/FABP4/ENPP1" "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ ...
 $ Count      : int  27 18 21 21 21 8 23 27 8 10 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

We have 53 significantly enriched terms (p.adjust < 0.05, log2foldchange > 1) that are up-regulated in response to dexamethasone treatment.

What do we need to know about the output from enrichGO()?

Some of the columns are self-explanatory, while others are not:

GeneRatio (k/n): "ratio of input genes that are annotated in a term." (# of genes in the input list that are annotated to the corresponding term (function) / # of genes in input list).

BgRatio (M/N): "ratio of all genes that are annotated in this term" (The number of genes that are annotated to the term / The total number of genes in the background gene list ("universe")).

pvalue: over-representation assessed using hypogeometric distribution, which is equivalent to a one sided Fisher's exact test.

p.adjust: Hypergeometric p-value after correction for multiple testing ("BH adjusted p-values"); see ?p.adjust.

qvalue: another take on FDR adjusted p-values; See ?qvalue::qvalue.

geneID: The gene IDs that overlap with the functional term.

Count: Total number of genes from the input gene list that match the functional term.

Info

p.adjust may potentially be more conservative and robust than q-value. See the following discussions:
- conceptual question about FDR, FDR adjusted p-value and q-value
- Why use the qvalue package instead of p.adjust in R stats?

Reduce redundancy

Because of the nature of GO, organized as a directed acyclic graph, parent terms can show enrichment due to over represented child terms. To reduce redundancy among enriched terms, clusterProfiler uses a simplify() function that uses the "GOSemSim package to calculate semantic similarities among enriched GO terms using multiple methods based on information content or graph structure." GO terms with >0.7 similarity are removed and represented by the most significant term (Wu et al. 2021).

Let's reduce our GO terms using simplify().

s_ego<-clusterProfiler::simplify(ego)
s_ego
#
# over-representation test
#
#...@organism    Homo sapiens 
#...@ontology    BP 
#...@keytype     SYMBOL 
#...@gene    chr [1:451] "SPARCL1" "CACNB2" "DUSP1" "SAMHD1" "MAOA" "GPX3" "STEAP2" ...
#...pvalues adjusted by 'BH' with cutoff <0.05 
#...35 enriched terms found
'data.frame':   35 obs. of  9 variables:
 $ ID         : chr  "GO:0003012" "GO:0045444" "GO:0030198" "GO:0043062" ...
 $ Description: chr  "muscle system process" "fat cell differentiation" "extracellular matrix organization" "extracellular structure organization" ...
 $ GeneRatio  : chr  "27/351" "18/351" "21/351" "21/351" ...
 $ BgRatio    : chr  "374/14114" "207/14114" "277/14114" "278/14114" ...
 $ pvalue     : num  7.27e-07 4.25e-06 6.02e-06 6.37e-06 6.73e-06 ...
 $ p.adjust   : num  0.00266 0.00493 0.00493 0.00493 0.00493 ...
 $ qvalue     : num  0.00238 0.0044 0.0044 0.0044 0.0044 ...
 $ geneID     : chr  "CACNB2/KLF15/ERRFI1/FOXO1/PDLIM5/LMCD1/LMOD1/NR4A3/SCN7A/ABCC9/OXTR/APBB2/ABAT/FOXO3/CFLAR/ACTA2/LEP/ATP2A2/GDN"| __truncated__ "SORT1/INHBB/ZBTB16/FOXO1/CEBPD/NR4A3/STEAP4/KLF5/LEP/CCDC71L/ALDH6A1/AXIN2/FRZB/NR4A1/LMO3/DIO2/FABP4/ENPP1" "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ ...
 $ Count      : int  27 18 21 21 21 8 23 27 10 8 ...
#...Citation
 T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
 clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
 The Innovation. 2021, 2(3):100141 

We are now left with 35 enriched terms.

Visualize results using enrichplot a complementary package of clusterProfiler

enrichplot is based on ggplot2 implementation. Therefore, all plots can be customized with ggplot2 functions.

Dotplot

Plot results using a dotplot.

dotplot(s_ego,showCategory=20,font.size=10,label_format=70)+
  scale_size_continuous(range=c(1, 7))+
  theme_minimal() +
  ggtitle("GO Enrichment of up-regulated genes")
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.

Lollipop plot

These are the same results exhibited differently. Notice the seamless integration with tidyverse functionality. clusterProfiler extends ggplot2 functionality to accept enrichment results directly.

s_ego %>% filter(p.adjust < 0.03) %>%
ggplot(showCategory = 20,
  aes(GeneRatio, forcats::fct_reorder(Description, GeneRatio))) + 
  geom_segment(aes(xend=0, yend = Description)) +
  geom_point(aes(color=p.adjust, size = Count)) +
  scale_color_viridis_c(guide=guide_colorbar(reverse=TRUE)) +
  scale_size_continuous(range=c(1, 7)) +
  theme_minimal() + 
  xlab("Gene Ratio") +
  ylab(NULL) + 
  ggtitle("GO Enrichment of up-regulated genes")

cnetplot

The category netplot (cnetplot) allows us to visualize the relationship and overlap of genes associated with our top terms (or terms that we specify). This is a nice way of visualizing genes that are important for several enriched terms.

# color genes by log2 fold changes; create named vector
foldchanges <- sig_up$log2FoldChange
names(foldchanges) <- sig_up$symbol

## by default cnetplot gives the top 5 significant terms 
#if we want to focus on specific terms we can subset our results
s_ego2 <- s_ego
s_ego2@result<-s_ego[c(6,15,32),]

cnetplot(s_ego2, 
         foldChange=foldchanges, 
         shadowtext='gene',
         cex_label_gene=0.25,
         cex_label_category=0.5,
         color_category="purple")         
Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'.
 The foldChange parameter will be removed in the next version.
Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(category = your_value)' instead of 'color_category'.
 The color_category parameter will be removed in the next version.
Warning in cnetplot.enrichResult(x, ...): Use 'cex.params = list(category_label = your_value)' instead of 'cex_label_category'.
 The cex_label_category parameter will be removed in the next version.
Warning in cnetplot.enrichResult(x, ...): Use 'cex.params = list(gene_label = your_value)' instead of 'cex_label_gene'.
 The cex_label_gene parameter will be removed in the next version.
Scale for size is already present.
Adding another scale for size, which will replace the existing scale.

Note

See how the enrichResult was subset above. These are S4 objects that can be subset using $ and []. To get a quick understanding of the enriched results, we can also use nrow, ncol, and dim.

Other plots with enrichplot

Other plot types possible with enrichplot include barplots, heatmaps (as alternatives to cnetplots), tree plots, enrichment maps, upset plots (alternative to venn diagram), ridgeline plots, and GSEA plots.

Over representation analysis using other databases

Using clusterProfiler you can conduct ORA with GO, KEGG, MKEGG (KEGG modules), and WikiPathways. Using complementary packages (i.e., DOSE, ReactomePA), ORA can also be conducted with Disease Ontology (DO), Reactome, network of cancer genes, and DisGeNET.

#Example with KEGG pathways
kgo <- enrichKEGG(gene = sig_up$entrez,
organism = 'hsa',
pvalueCutoff  = 0.05,
universe = as.character(dexp$entrez)
)
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
kgo_df<-data.frame(kgo)

Note

clusterProfiler uses the latest version of the KEGG database and supports species with KEGG annotation ( > 6000).

Merge enrichResult

It is possible to merge multiple enrichResult objects; this could be up-regulated merged with down-regulated or different functional repositories.

mlist<-list(s_ego,kgo)
names(mlist)<-c("GO-enrich","KEGG-enrich")
mresult<-merge_result(mlist)
dotplot(mresult,showCategory=10) 

Custom ORA using enricher()

There are many databases devoted to relating genes and gene products to pathways, processes, and other phenomenon and a multitude of organisms that may not have annotations in widely used databases (e.g., KEGG, GO). clusterProfiler has generalizable functions that allow users to customize the knowledge database and annotations used for functional enrichment.

Because the gene matrix transposed (GMT) format is often used to share gene set annotations, clusterProfiler provides the parser functions read.gmt() and read.gmt.wp(), the latter of which is specific to Wikipathways.

For example, let's download and use the Molecular Signatures Database (MSigDB) for ORA. In this case, we are interested in the canonical pathways from C2.

C2_cp<-read.gmt("../data/S3_clusterProfiler/c2.cp.v2023.1.Hs.symbols.gmt")

Note

We could also use the package msigdbr to obtain MSigDB gene sets in a tidy format without downloading directly from MSigDB.

Use enricher

Ex <- enricher(sig_up$symbol, 
               TERM2GENE = C2_cp,
               universe=dexp$symbol)
head(Ex)
                                                         ID
NABA_CORE_MATRISOME                     NABA_CORE_MATRISOME
PID_INTEGRIN1_PATHWAY                 PID_INTEGRIN1_PATHWAY
WP_COPPER_HOMEOSTASIS                 WP_COPPER_HOMEOSTASIS
BIOCARTA_NPP1_PATHWAY                 BIOCARTA_NPP1_PATHWAY
WP_GABA_METABOLISM_AKA_GHB       WP_GABA_METABOLISM_AKA_GHB
REACTOME_LAMININ_INTERACTIONS REACTOME_LAMININ_INTERACTIONS
                                                Description GeneRatio   BgRatio
NABA_CORE_MATRISOME                     NABA_CORE_MATRISOME    23/296 235/10641
PID_INTEGRIN1_PATHWAY                 PID_INTEGRIN1_PATHWAY    10/296  61/10641
WP_COPPER_HOMEOSTASIS                 WP_COPPER_HOMEOSTASIS     8/296  51/10641
BIOCARTA_NPP1_PATHWAY                 BIOCARTA_NPP1_PATHWAY     4/296  10/10641
WP_GABA_METABOLISM_AKA_GHB       WP_GABA_METABOLISM_AKA_GHB     4/296  10/10641
REACTOME_LAMININ_INTERACTIONS REACTOME_LAMININ_INTERACTIONS     6/296  30/10641
                                    pvalue     p.adjust      qvalue
NABA_CORE_MATRISOME           1.516106e-07 0.0001952744 0.000179858
PID_INTEGRIN1_PATHWAY         6.120069e-06 0.0039413246 0.003630167
WP_COPPER_HOMEOSTASIS         7.335030e-05 0.0269492209 0.024821651
BIOCARTA_NPP1_PATHWAY         1.079261e-04 0.0269492209 0.024821651
WP_GABA_METABOLISM_AKA_GHB    1.079261e-04 0.0269492209 0.024821651
REACTOME_LAMININ_INTERACTIONS 1.488267e-04 0.0269492209 0.024821651
                                                                                                                                                                       geneID
NABA_CORE_MATRISOME           SPARCL1/LAMA2/COL11A1/CRISPLD2/SPON1/IGFBP2/COL7A1/FBN2/MFGE8/COL4A1/OMD/COL4A4/RSPO1/NID1/CILP/CRIM1/DPT/THBS1/LGI3/COL4A3/HAPLN3/PRG4/COL28A1
PID_INTEGRIN1_PATHWAY                                                                                       LAMA2/COL11A1/COL7A1/COL4A1/COL4A4/NID1/ITGA10/ITGA8/THBS1/COL4A3
WP_COPPER_HOMEOSTASIS                                                                                                         STEAP2/MT2A/STEAP1/FOXO1/MT1E/STEAP4/FOXO3/MT1X
BIOCARTA_NPP1_PATHWAY                                                                                                                              COL4A1/COL4A4/COL4A3/ENPP1
WP_GABA_METABOLISM_AKA_GHB                                                                                                                               ABAT/GLS/ADHFE1/MAOB
REACTOME_LAMININ_INTERACTIONS                                                                                                          LAMA2/COL7A1/COL4A1/COL4A4/NID1/COL4A3
                              Count
NABA_CORE_MATRISOME              23
PID_INTEGRIN1_PATHWAY            10
WP_COPPER_HOMEOSTASIS             8
BIOCARTA_NPP1_PATHWAY             4
WP_GABA_METABOLISM_AKA_GHB        4
REACTOME_LAMININ_INTERACTIONS     6

This requires a TERM2GENE argument. This argument accepts a data frame with two columns: "term" and "gene". "term" includes the term ID, while "gene" includes the mapped gene associated with that term. You may also want to include a TERM2NAME argument, which is a dataframe with a column for the term ID and the term name.

GSEA

The complementary function for a customized GSEA, rather than ORA, is GSEA().

Gene set files

Enrichr has a fantastic library of potential gene sets of interest.

Compare functional profiles between conditions

clusterProfiler is great at biological theme comparison. The function, compareCluster, allows us to compare multiple conditions, time points, experiments, etc. "Comparing functional profiles can reveal functional consensus and differences among different experiments and helps in identifying differential functional modules in omics datasets" (Wu et al. 2021)

This function takes a list of gene lists from the different conditions we want to compare, creates functional enrichment profiles for each list, and combines the results in a single file that can easily be visualized. Alternatively, you can provide a formula (see here).

comparelist<-list(sig_down$Ensembl,sig_up$Ensembl)
names(comparelist)<-c("down-regulated","up-regulated") 
cclust<-compareCluster(geneCluster = comparelist, 
               fun = enrichGO,
               OrgDb= org.Hs.eg.db,
               keyType = "ENSEMBL",
               ont= "BP",
               universe=dexp$Ensembl)

Visualize results with a dotplot.

dotplot(cclust,showCategory=10)  

GSEA with clusterProfiler

clusterProfiler also supports the Broad Institute software method of gene set enrichment analysis (GSEA) developed by Subramanian et al. 2005. Because this method uses all of the data (complete ranked gene list), this method is able to unveil situations where genes within a gene set "change in a small but coordinated way" (i.e., all of the data is used regardless of arbitrary cut-offs like p-values).

GSEA requires an ordered ranked gene list as input.

View GSEA_data

head(GSEA_data)
      ALOX15B        ZBTB16        GUCY2D RP11-357D18.1        STEAP4 
    10.970146      7.273192      6.449094      6.184841      5.112808 
      ANGPTL7 
     4.823511 

Method outputs

  1. Enrichment Score and normalized enrichment score.

    Enrichment Score - represents the degree to which a set is over-represented at the top or bottom of the ranked list.

    normalized enrichment score (NES) - allows comparability across gene sets by accounting for differences in gene set size and in correlations between gene sets and the expression dataset. A positive NES is associated with pathway activation, and a negative NES is associated with pathway suppression.

  2. Significance calculation (pvalue) using a permutation test on gene labels.

  3. Adjustment of multiple hypothesis testing (p.adjust) and FDR control (qvalue).

Leading edge analysis outputs:

The leading edge subset of a gene set is the subset of members that contribute most to the ES. For a positive ES, the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative ES, it is the set of members that appear subsequent to the peak score. --- GSEAUserGuide

Leading_edge:

tags - the percentage of genes before or after the peak in the running enrichment score, which is an indication of the percentage of genes contributing to the enrichment score.

list - where in the list the enrichment score is attained.

signal - enrichment signal strength.

core_enrichment:

Returns the core enriched genes, those that contribute the most to a given gene set.

Note

The gene label permutation step uses random reordering. Results may differ between runs if you do not use a random seed generator (e.g., set.seed()).

Similar to ORA, GSEA can be performed using GO (gseGO), KEGG pathways (gseKEGG), KEGG modules (gseMKEGG), and WikiPathways (gseWP). With the DOSE package, you can also use gseDGN, gseDO, and gseNCG, and with ReactomePA (gsePathway).

There is also a customizable function similar to enricher, which is GSEA().

GSEA using GO (gseGO())

set.seed(123)
gsea_go <- gseGO(geneList = GSEA_data,
              OrgDb = org.Hs.eg.db,
              ont = "BP",
              keyType = "SYMBOL",
              seed=TRUE)
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.05% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
leading edge analysis...
done...

Plot GSEA results

enrichplot offers several functions specific to GSEA output, especially in regard to visualizing the running score and preranked list (gseaplot() and gseaplot2()). gseaplot2() allows us to visualize several enriched gene sets at once.

enrichplot::gseaplot2(gsea_go, geneSetID = c(1:2))

enrichplot::gseaplot2(gsea_go, geneSetID = c(29,35))

We can use an upset plot to look at fold distributions over gene sets. This requries the installation of the ggupset package.

enrichplot::upsetplot(gsea_go) 

Note

Other enrichplot functions can also be applied to GSEA outputs (e.g., dotplot, cnetplot, enrichmap, etc.) or we can use the ggplot2 functionality.

Using clusterProfiler dplyr verb extensions

We can use clusterProfiler extensions of dplyr verbs (mutate(),filter(),select(),summarize(),slice(),group_by(),arrange()) for easy manipulation of enrichResult, gseaResult, and compareClusterResult objects. mutate() can be particularly powerful and allow us to use different enrichment result metrics (e.g., rich factor and fold enrichment for ORA).

Let's filter the gseaResult to only include NES results greater than 1.95 or less than -2.

#filter
gsea_go2 <- filter(gsea_go,NES > 1.95 | NES < -2)
dim(gsea_go2)
[1] 10 11
#replot upset plot
enrichplot::upsetplot(gsea_go2,n=10) 

#plot using a barplot
ggplot(gsea_go2, showCategory=10, aes(NES, fct_reorder(Description, NES),
fill=qvalue)) +
  geom_col() +
  geom_vline(xintercept=0, linetype="dashed", color="blue",size=1)+
  scale_fill_gradientn(colours=c("#b3eebe","#46bac2", "#371ea3"),
                       guide=guide_colorbar(reverse=TRUE))+
  scale_x_continuous(expand=c(0,0))+
  theme_bw() + 
  #theme(text=element_text(size=8))+
  xlab("Normalized Enrichment Score") +
  ylab(NULL) +
  ggtitle("GSEA with GO")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

ChIPseeker

Following ChIP-seq, ChIPseeker paired with clusterProfiler can be used to connect functional analysis with genomic ROIs. See the vignette here.

Other packages to check out

GSVA package - useful for clinical data (large patient cohort).
pathview.
gProfileR.

Additional Resources

Readings

  1. Chicco D, Agapito G (2022) Nine quick tips for pathway enrichment analysis. PLoS Comput Biol 18(8): e1010348. https://doi.org/10.1371/journal.pcbi.1010348.
  2. Wijesooriya K, Jadaan SA, Perera KL, Kaur T, Ziemann M. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput Biol. 2022 Mar 9;18(3):e1009935. doi: 10.1371/journal.pcbi.1009935. PMID: 35263338; PMCID: PMC8936487..
  3. Ludwig Geistlinger, Gergely Csaba, Mara Santarelli, Marcel Ramos, Lucas Schiffer, Nitesh Turaga, Charity Law, Sean Davis, Vincent Carey, Martin Morgan, Ralf Zimmer, Levi Waldron, Toward a gold standard for benchmarking gene set enrichment analysis, Briefings in Bioinformatics, Volume 22, Issue 1, January 2021, Pages 545–556, https://doi.org/10.1093/bib/bbz158

Useful tutorials

  1. Functional Analysis for RNA-seq: Over-representation Analysis
  2. Functional Analysis for RNA-seq: Functional Class Scoring
  3. Functional Enrichment Analysis.

R Session info

sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] enrichplot_1.18.4     ReactomePA_1.42.0     DOSE_3.24.2          
 [4] lubridate_1.9.2       forcats_1.0.0         stringr_1.5.0        
 [7] dplyr_1.1.2           purrr_1.0.1           readr_2.1.4          
[10] tidyr_1.3.0           tibble_3.2.1          ggplot2_3.4.2        
[13] tidyverse_2.0.0       org.Hs.eg.db_3.16.0   AnnotationDbi_1.60.2 
[16] IRanges_2.32.0        S4Vectors_0.36.2      Biobase_2.58.0       
[19] BiocGenerics_0.44.0   clusterProfiler_4.6.2

loaded via a namespace (and not attached):
  [1] ggnewscale_0.4.8       fgsea_1.24.0           colorspace_2.1-0      
  [4] ggtree_3.6.2           gson_0.1.0             qvalue_2.30.0         
  [7] XVector_0.38.0         aplot_0.1.10           rstudioapi_0.14       
 [10] farver_2.1.1           graphlayouts_1.0.0     ggrepel_0.9.3         
 [13] bit64_4.0.5            fansi_1.0.4            scatterpie_0.1.9      
 [16] codetools_0.2-19       splines_4.2.3          cachem_1.0.7          
 [19] GOSemSim_2.24.0        knitr_1.42             polyclip_1.10-4       
 [22] jsonlite_1.8.4         GO.db_3.16.0           png_0.1-8             
 [25] graph_1.76.0           graphite_1.44.0        ggforce_0.4.1         
 [28] compiler_4.2.3         httr_1.4.5             Matrix_1.5-4          
 [31] fastmap_1.1.1          lazyeval_0.2.2         cli_3.6.1             
 [34] tweenr_2.0.2           htmltools_0.5.5        tools_4.2.3           
 [37] igraph_1.4.2           gtable_0.3.3           glue_1.6.2            
 [40] GenomeInfoDbData_1.2.9 reshape2_1.4.4         rappdirs_0.3.3        
 [43] fastmatch_1.1-3        Rcpp_1.0.10            vctrs_0.6.2           
 [46] Biostrings_2.66.0      ape_5.7-1              nlme_3.1-162          
 [49] ggraph_2.1.0           xfun_0.39              timechange_0.2.0      
 [52] lifecycle_1.0.3        zlibbioc_1.44.0        MASS_7.3-59           
 [55] scales_1.2.1           tidygraph_1.2.3        reactome.db_1.82.0    
 [58] hms_1.1.3              ggupset_0.3.0          parallel_4.2.3        
 [61] RColorBrewer_1.1-3     yaml_2.3.7             memoise_2.0.1         
 [64] gridExtra_2.3          downloader_0.4         ggfun_0.0.9           
 [67] HDO.db_0.99.1          yulab.utils_0.0.6      stringi_1.7.12        
 [70] RSQLite_2.3.1          tidytree_0.4.2         BiocParallel_1.32.6   
 [73] GenomeInfoDb_1.34.9    rlang_1.1.0            pkgconfig_2.0.3       
 [76] bitops_1.0-7           evaluate_0.20          lattice_0.21-8        
 [79] labeling_0.4.2         treeio_1.22.0          patchwork_1.1.2       
 [82] htmlwidgets_1.6.2      cowplot_1.1.1          shadowtext_0.1.2      
 [85] bit_4.0.5              tidyselect_1.2.0       plyr_1.8.8            
 [88] magrittr_2.0.3         R6_2.5.1               generics_0.1.3        
 [91] DBI_1.1.3              pillar_1.9.0           withr_2.5.0           
 [94] KEGGREST_1.38.0        RCurl_1.98-1.12        crayon_1.5.2          
 [97] utf8_1.2.3             tzdb_0.3.0             rmarkdown_2.21        
[100] viridis_0.6.2          grid_4.2.3             data.table_1.14.8     
[103] blob_1.2.4             digest_0.6.31          gridGraphics_0.5-1    
[106] munsell_0.5.0          viridisLite_0.4.1      ggplotify_0.1.0