Functional Enrichment Analysis with clusterProfiler
- Understand the capabilities of
clusterProfilerin the context of functional enrichment analysis
- Learn about related packages to improve results reporting
What is Functional Enrichment Analysis?
Functional enrichment and pathway analysis have broad and varying definitions. For our purposes, there are three general approaches (Khatri et al. 2012):
- Over-Representation Analysis (ORA)
- Functional Class Scoring (FCS)
- Pathway Topology (PT)
Over-representation analysis (ORA)
statistically evaluates the fraction of genes in a particular pathway found among the set of genes showing changes in expression --- Khatri et al. 2012
From this, ORA determines which pathways are over or under represented by asking "are there more annotations in the gene list than expected?"
- tests based on hypergeometric, chi-square, or binomial distribution
- includes GO enrichment methods
- prioritizes a subset of genes using an arbitrary, user determined threshold (e.g., p-value)
- doesn't require the data, just the gene identifiers
- not great for gene lists with less than 50 genes
Functional Class Scoring (FCS)
Includes ‘gene set scoring’ methods such as GSEA, which first compute DE scores for all genes measured, and subsequently compute gene set scores by aggregating the scores of contained genes. --- Geistlinger et al. 2021
- ignore gene position and role
- do not pre-select genes (considers all gene expression) and so you must include data with gene identifiers for ranking
- ranking by magnitude of change in gene expression between conditions
- determines where genes from a gene set fall in the ranking
- creates a running sum statistic
- uses a permutational approach to determine significance
- Broad Institute software but also available using web-based tools, R, and Qlucore (proprietary).
- also considered a strategy encompassing a range of methods
- self-contained methods vs competitive methods
ORA and FCS discard a large amount of information. These methods use gene sets, and even if the gene sets represent specific pathways, structural information such as gene product interactions, positions of genes, and types of genes is completely ignored. Pathway topology methods seek to rectify this problem.
PT methods are mostly considered network based.
Impact analysis (iPathwayGuide), Pathway-Express, SPIA, NetGSA, etc. (See Nguyen et al. 2019 for a review of PT methods.)
What is clusterProfiler?
Advantages of clusterProfiler
- ORA and GSEA support
- Functionality for multiple ontology and pathway databases
- Allows user defined databases and annotations, which is particularly useful for non-model organisms
- Easily compare functional profiles between treatments
- integrates the tidy philosophy
- complementary packages (
DOSE) provide a comprehensive tool suite for understanding functional enrichment.
Figure from T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
library(clusterProfiler) library(org.Hs.eg.db) library(tidyverse) library(DOSE) library(ReactomePA) library(enrichplot)
For today's lesson, we will use data from the Bioconductor package
airway data is from Himes et al. (2014). These data, which are contained within a
RangedSummarizedExperiment, object are from a bulk RNAseq experiment. In the experiment, the authors "characterized transcriptomic changes in four primary human ASM cell lines that were treated with dexamethasone," a common therapy for asthma. The airway package includes RNAseq count data from 8 airway smooth muscle cell samples. Each cell line includes a treated and untreated negative control. Differential expression testing using
DESeq2 was applied to these data.
Load the differential expression results:
Create gene lists:
# Extract all significant results for ORA sigDEGs <- dplyr::filter(dexp, padj < 0.05) # limiting results by log2FoldChange; may want to use log2FoldChange<0 sig_down<- dplyr::filter(dexp, padj < 0.05 & log2FoldChange < -1) # limiting results by log2FoldChange; may want to use log2FoldChange>0 sig_up<- dplyr::filter(dexp, padj < 0.05 & log2FoldChange > 1) # Rank results for GSEA; put in decreasing order. GSEA_data<-dexp %>% filter(!is.na(padj)) %>% arrange(desc(log2FoldChange)) %>% # create a named vector with log fold changes pull(log2FoldChange,name=symbol)
We have already mapped gene symbols and Entrez IDs to our data using
clusterProfiler does offer support for gene ID conversion using the functions
bitr() uses the
OrgDb packages, of which there are 19 databases.
bitr_kegg() unsurprisingly uses KEGG organism annotations. Use
search_kegg_organism() to find the appropriate kegg code.
#get keytype info keytypes(org.Hs.eg.db)
 "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"  "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"  "GENETYPE" "GO" "GOALL" "IPI" "MAP"  "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"  "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"  "UNIPROT"
#need entrez IDs? ids<-bitr(sigDEGs$Ensembl, fromType="ENSEMBL", toType="ENTREZID", OrgDb="org.Hs.eg.db") #NA values drop by default
'select()' returned 1:many mapping between keys and columns
Warning in bitr(sigDEGs$Ensembl, fromType = "ENSEMBL", toType = "ENTREZID", : 5.31% of input gene IDs are fail to map...
ENSEMBL ENTREZID 1 ENSG00000152583 8404 2 ENSG00000165995 783 3 ENSG00000120129 1843 4 ENSG00000101347 25939 5 ENSG00000189221 4128 6 ENSG00000211445 2878
clusterProfiler functions are fully supported by OrgDb and you can provide a Keytype argument within functions.
Over-representation analysis with clusterProfiler
clusterProfiler, along with complementary packages, can easily be used to generate functional enrichment results using over-representation analysis from the following databases: GO, KEGG, DOSE, REACTOME, Wikipathways, DisGeNET, network of cancer genes.
GO enrichment refers specifically to gene ontology.
The Gene Ontology (GO) provides a framework and set of concepts for describing the functions of gene products from all organisms. --- https://www.ebi.ac.uk/ols/ontologies/go.
GO integrates information about gene product function in the context of three domains:
- molecular function (F) - "the molecular activities of individual gene products" (e.g., kinase)
- cellular component (C) - "where the gene products are active" (e.g., mitochondria)
- *biological process (P) - "the pathways and larger processes to which that gene product’s activity contributes" (e.g., transport)
*Commonly used for pathway enrichment analysis
GO terms can be highly redundant, and GO enrichment analysis often requires us to deal with this redundancy (See below).
Read more about GO enrichment analysis with
Let's perform GO enrichment on our up-regulated genes. Notice that
OrgDB databases and
keyType can be specified directly in the function. In addition, you can specify a list of background genes or use the default which will use all genes in the queried database (e.g., GO-BP). Here we are specifying the genes that underwent differential expression testing, but there is currently an open issue on the way the universe is affecting results, which can impact significance values.
ego <- enrichGO(gene= sig_up$symbol, OrgDb= org.Hs.eg.db, keyType = "SYMBOL", ont= "BP", universe=dexp$symbol)
If we want to included all three ontologies (i.e., MF, CC, and BP), we can do so with
Let's take a peek at the results:
# # over-representation test # #...@organism Homo sapiens #...@ontology BP #...@keytype SYMBOL #...@gene chr [1:451] "SPARCL1" "CACNB2" "DUSP1" "SAMHD1" "MAOA" "GPX3" "STEAP2" ... #...pvalues adjusted by 'BH' with cutoff <0.05 #...53 enriched terms found 'data.frame': 53 obs. of 9 variables: $ ID : chr "GO:0003012" "GO:0045444" "GO:0030198" "GO:0043062" ... $ Description: chr "muscle system process" "fat cell differentiation" "extracellular matrix organization" "extracellular structure organization" ... $ GeneRatio : chr "27/351" "18/351" "21/351" "21/351" ... $ BgRatio : chr "374/14114" "207/14114" "277/14114" "278/14114" ... $ pvalue : num 7.27e-07 4.25e-06 6.02e-06 6.37e-06 6.73e-06 ... $ p.adjust : num 0.00266 0.00493 0.00493 0.00493 0.00493 ... $ qvalue : num 0.00238 0.0044 0.0044 0.0044 0.0044 ... $ geneID : chr "CACNB2/KLF15/ERRFI1/FOXO1/PDLIM5/LMCD1/LMOD1/NR4A3/SCN7A/ABCC9/OXTR/APBB2/ABAT/FOXO3/CFLAR/ACTA2/LEP/ATP2A2/GDN"| __truncated__ "SORT1/INHBB/ZBTB16/FOXO1/CEBPD/NR4A3/STEAP4/KLF5/LEP/CCDC71L/ALDH6A1/AXIN2/FRZB/NR4A1/LMO3/DIO2/FABP4/ENPP1" "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ ... $ Count : int 27 18 21 21 21 8 23 27 8 10 ... #...Citation T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
We have 53 significantly enriched terms (p.adjust < 0.05, log2foldchange > 1) that are up-regulated in response to dexamethasone treatment.
What do we need to know about the output from
Some of the columns are self-explanatory, while others are not:
GeneRatio (k/n): "ratio of input genes that are annotated in a term." (# of genes in the input list that are annotated to the corresponding term (function) / # of genes in input list).
BgRatio (M/N): "ratio of all genes that are annotated in this term" (The number of genes that are annotated to the term / The total number of genes in the background gene list ("universe")).
pvalue: over-representation assessed using hypogeometric distribution, which is equivalent to a one sided Fisher's exact test.
p.adjust: Hypergeometric p-value after correction for multiple testing ("BH adjusted p-values"); see
qvalue: another take on FDR adjusted p-values; See
geneID: The gene IDs that overlap with the functional term.
Count: Total number of genes from the input gene list that match the functional term.
p.adjust may potentially be more conservative and robust than q-value. See the following discussions:
- conceptual question about FDR, FDR adjusted p-value and q-value
- Why use the qvalue package instead of p.adjust in R stats?
Because of the nature of GO, organized as a directed acyclic graph, parent terms can show enrichment due to over represented child terms. To reduce redundancy among enriched terms,
clusterProfiler uses a
simplify() function that uses the "
GOSemSim package to calculate semantic similarities among enriched GO terms using multiple methods based on information content or graph structure." GO terms with >0.7 similarity are removed and represented by the most significant term (Wu et al. 2021).
Let's reduce our GO terms using
# # over-representation test # #...@organism Homo sapiens #...@ontology BP #...@keytype SYMBOL #...@gene chr [1:451] "SPARCL1" "CACNB2" "DUSP1" "SAMHD1" "MAOA" "GPX3" "STEAP2" ... #...pvalues adjusted by 'BH' with cutoff <0.05 #...35 enriched terms found 'data.frame': 35 obs. of 9 variables: $ ID : chr "GO:0003012" "GO:0045444" "GO:0030198" "GO:0043062" ... $ Description: chr "muscle system process" "fat cell differentiation" "extracellular matrix organization" "extracellular structure organization" ... $ GeneRatio : chr "27/351" "18/351" "21/351" "21/351" ... $ BgRatio : chr "374/14114" "207/14114" "277/14114" "278/14114" ... $ pvalue : num 7.27e-07 4.25e-06 6.02e-06 6.37e-06 6.73e-06 ... $ p.adjust : num 0.00266 0.00493 0.00493 0.00493 0.00493 ... $ qvalue : num 0.00238 0.0044 0.0044 0.0044 0.0044 ... $ geneID : chr "CACNB2/KLF15/ERRFI1/FOXO1/PDLIM5/LMCD1/LMOD1/NR4A3/SCN7A/ABCC9/OXTR/APBB2/ABAT/FOXO3/CFLAR/ACTA2/LEP/ATP2A2/GDN"| __truncated__ "SORT1/INHBB/ZBTB16/FOXO1/CEBPD/NR4A3/STEAP4/KLF5/LEP/CCDC71L/ALDH6A1/AXIN2/FRZB/NR4A1/LMO3/DIO2/FABP4/ENPP1" "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ "ADAMTS1/PDPN/PTX3/LAMA2/COL11A1/CRISPLD2/GPM6B/COL4A1/COL4A4/CFLAR/NID1/ITGA8/DPT/SPINT2/ANGPTL7/MMP15/COL4A3/A"| __truncated__ ... $ Count : int 27 18 21 21 21 8 23 27 10 8 ... #...Citation T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
We are now left with 35 enriched terms.
Visualize results using
enrichplot a complementary package of
enrichplot is based on
ggplot2 implementation. Therefore, all plots can be customized with
Plot results using a dotplot.
dotplot(s_ego,showCategory=20,font.size=10,label_format=70)+ scale_size_continuous(range=c(1, 7))+ theme_minimal() + ggtitle("GO Enrichment of up-regulated genes")
Scale for size is already present. Adding another scale for size, which will replace the existing scale.
These are the same results exhibited differently. Notice the seamless integration with tidyverse functionality.
clusterProfiler extends ggplot2 functionality to accept enrichment results directly.
s_ego %>% filter(p.adjust < 0.03) %>% ggplot(showCategory = 20, aes(GeneRatio, forcats::fct_reorder(Description, GeneRatio))) + geom_segment(aes(xend=0, yend = Description)) + geom_point(aes(color=p.adjust, size = Count)) + scale_color_viridis_c(guide=guide_colorbar(reverse=TRUE)) + scale_size_continuous(range=c(1, 7)) + theme_minimal() + xlab("Gene Ratio") + ylab(NULL) + ggtitle("GO Enrichment of up-regulated genes")
The category netplot (cnetplot) allows us to visualize the relationship and overlap of genes associated with our top terms (or terms that we specify). This is a nice way of visualizing genes that are important for several enriched terms.
# color genes by log2 fold changes; create named vector foldchanges <- sig_up$log2FoldChange names(foldchanges) <- sig_up$symbol ## by default cnetplot gives the top 5 significant terms #if we want to focus on specific terms we can subset our results s_ego2 <- s_ego s_ego2@result<-s_ego[c(6,15,32),] cnetplot(s_ego2, foldChange=foldchanges, shadowtext='gene', cex_label_gene=0.25, cex_label_category=0.5, color_category="purple")
Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(foldChange = your_value)' instead of 'foldChange'. The foldChange parameter will be removed in the next version.
Warning in cnetplot.enrichResult(x, ...): Use 'color.params = list(category = your_value)' instead of 'color_category'. The color_category parameter will be removed in the next version.
Warning in cnetplot.enrichResult(x, ...): Use 'cex.params = list(category_label = your_value)' instead of 'cex_label_category'. The cex_label_category parameter will be removed in the next version.
Warning in cnetplot.enrichResult(x, ...): Use 'cex.params = list(gene_label = your_value)' instead of 'cex_label_gene'. The cex_label_gene parameter will be removed in the next version.
Scale for size is already present. Adding another scale for size, which will replace the existing scale.
See how the
enrichResult was subset above. These are S4 objects that can be subset using
. To get a quick understanding of the enriched results, we can also use
Other plots with
Other plot types possible with
enrichplot include barplots, heatmaps (as alternatives to cnetplots), tree plots, enrichment maps, upset plots (alternative to venn diagram), ridgeline plots, and GSEA plots.
Over representation analysis using other databases
clusterProfiler you can conduct ORA with GO, KEGG, MKEGG (KEGG modules), and WikiPathways. Using complementary packages (i.e., DOSE, ReactomePA), ORA can also be conducted with Disease Ontology (DO), Reactome, network of cancer genes, and DisGeNET.
#Example with KEGG pathways kgo <- enrichKEGG(gene = sig_up$entrez, organism = 'hsa', pvalueCutoff = 0.05, universe = as.character(dexp$entrez) )
Reading KEGG annotation online: "https://rest.kegg.jp/link/hsa/pathway"...
Reading KEGG annotation online: "https://rest.kegg.jp/list/pathway/hsa"...
clusterProfiler uses the latest version of the KEGG database and supports species with KEGG annotation ( > 6000).
It is possible to merge multiple enrichResult objects; this could be up-regulated merged with down-regulated or different functional repositories.
mlist<-list(s_ego,kgo) names(mlist)<-c("GO-enrich","KEGG-enrich") mresult<-merge_result(mlist) dotplot(mresult,showCategory=10)
Custom ORA using
There are many databases devoted to relating genes and gene products to pathways, processes, and other phenomenon and a multitude of organisms that may not have annotations in widely used databases (e.g., KEGG, GO).
clusterProfiler has generalizable functions that allow users to customize the knowledge database and annotations used for functional enrichment.
Because the gene matrix transposed (GMT) format is often used to share gene set annotations,
clusterProfiler provides the parser functions
read.gmt.wp(), the latter of which is specific to Wikipathways.
We could also use the package
msigdbr to obtain MSigDB gene sets in a tidy format without downloading directly from MSigDB.
Ex <- enricher(sig_up$symbol, TERM2GENE = C2_cp, universe=dexp$symbol) head(Ex)
ID NABA_CORE_MATRISOME NABA_CORE_MATRISOME PID_INTEGRIN1_PATHWAY PID_INTEGRIN1_PATHWAY WP_COPPER_HOMEOSTASIS WP_COPPER_HOMEOSTASIS BIOCARTA_NPP1_PATHWAY BIOCARTA_NPP1_PATHWAY WP_GABA_METABOLISM_AKA_GHB WP_GABA_METABOLISM_AKA_GHB REACTOME_LAMININ_INTERACTIONS REACTOME_LAMININ_INTERACTIONS Description GeneRatio BgRatio NABA_CORE_MATRISOME NABA_CORE_MATRISOME 23/296 235/10641 PID_INTEGRIN1_PATHWAY PID_INTEGRIN1_PATHWAY 10/296 61/10641 WP_COPPER_HOMEOSTASIS WP_COPPER_HOMEOSTASIS 8/296 51/10641 BIOCARTA_NPP1_PATHWAY BIOCARTA_NPP1_PATHWAY 4/296 10/10641 WP_GABA_METABOLISM_AKA_GHB WP_GABA_METABOLISM_AKA_GHB 4/296 10/10641 REACTOME_LAMININ_INTERACTIONS REACTOME_LAMININ_INTERACTIONS 6/296 30/10641 pvalue p.adjust qvalue NABA_CORE_MATRISOME 1.516106e-07 0.0001952744 0.000179858 PID_INTEGRIN1_PATHWAY 6.120069e-06 0.0039413246 0.003630167 WP_COPPER_HOMEOSTASIS 7.335030e-05 0.0269492209 0.024821651 BIOCARTA_NPP1_PATHWAY 1.079261e-04 0.0269492209 0.024821651 WP_GABA_METABOLISM_AKA_GHB 1.079261e-04 0.0269492209 0.024821651 REACTOME_LAMININ_INTERACTIONS 1.488267e-04 0.0269492209 0.024821651 geneID NABA_CORE_MATRISOME SPARCL1/LAMA2/COL11A1/CRISPLD2/SPON1/IGFBP2/COL7A1/FBN2/MFGE8/COL4A1/OMD/COL4A4/RSPO1/NID1/CILP/CRIM1/DPT/THBS1/LGI3/COL4A3/HAPLN3/PRG4/COL28A1 PID_INTEGRIN1_PATHWAY LAMA2/COL11A1/COL7A1/COL4A1/COL4A4/NID1/ITGA10/ITGA8/THBS1/COL4A3 WP_COPPER_HOMEOSTASIS STEAP2/MT2A/STEAP1/FOXO1/MT1E/STEAP4/FOXO3/MT1X BIOCARTA_NPP1_PATHWAY COL4A1/COL4A4/COL4A3/ENPP1 WP_GABA_METABOLISM_AKA_GHB ABAT/GLS/ADHFE1/MAOB REACTOME_LAMININ_INTERACTIONS LAMA2/COL7A1/COL4A1/COL4A4/NID1/COL4A3 Count NABA_CORE_MATRISOME 23 PID_INTEGRIN1_PATHWAY 10 WP_COPPER_HOMEOSTASIS 8 BIOCARTA_NPP1_PATHWAY 4 WP_GABA_METABOLISM_AKA_GHB 4 REACTOME_LAMININ_INTERACTIONS 6
This requires a
TERM2GENE argument. This argument accepts a data frame with two columns: "term" and "gene". "term" includes the term ID, while "gene" includes the mapped gene associated with that term. You may also want to include a
TERM2NAME argument, which is a dataframe with a column for the term ID and the term name.
The complementary function for a customized GSEA, rather than ORA, is
Gene set files
Enrichr has a fantastic library of potential gene sets of interest.
Compare functional profiles between conditions
clusterProfiler is great at biological theme comparison. The function,
compareCluster, allows us to compare multiple conditions, time points, experiments, etc. "Comparing functional profiles can reveal functional consensus and differences among different experiments and helps in identifying differential functional modules in omics datasets" (Wu et al. 2021)
This function takes a list of gene lists from the different conditions we want to compare, creates functional enrichment profiles for each list, and combines the results in a single file that can easily be visualized. Alternatively, you can provide a formula (see here).
comparelist<-list(sig_down$Ensembl,sig_up$Ensembl) names(comparelist)<-c("down-regulated","up-regulated") cclust<-compareCluster(geneCluster = comparelist, fun = enrichGO, OrgDb= org.Hs.eg.db, keyType = "ENSEMBL", ont= "BP", universe=dexp$Ensembl)
Visualize results with a dotplot.
clusterProfiler also supports the Broad Institute software method of gene set enrichment analysis (GSEA) developed by Subramanian et al. 2005. Because this method uses all of the data (complete ranked gene list), this method is able to unveil situations where genes within a gene set "change in a small but coordinated way" (i.e., all of the data is used regardless of arbitrary cut-offs like p-values).
GSEA requires an ordered ranked gene list as input.
ALOX15B ZBTB16 GUCY2D RP11-357D18.1 STEAP4 10.970146 7.273192 6.449094 6.184841 5.112808 ANGPTL7 4.823511
Enrichment Score and normalized enrichment score.
Enrichment Score - represents the degree to which a set is over-represented at the top or bottom of the ranked list.
normalized enrichment score (NES) - allows comparability across gene sets by accounting for differences in gene set size and in correlations between gene sets and the expression dataset. A positive NES is associated with pathway activation, and a negative NES is associated with pathway suppression.
Significance calculation (pvalue) using a permutation test on gene labels.
- Adjustment of multiple hypothesis testing (p.adjust) and FDR control (qvalue).
Leading edge analysis outputs:
The leading edge subset of a gene set is the subset of members that contribute most to the ES. For a positive ES, the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative ES, it is the set of members that appear subsequent to the peak score. --- GSEAUserGuide
tags - the percentage of genes before or after the peak in the running enrichment score, which is an indication of the percentage of genes contributing to the enrichment score.
list - where in the list the enrichment score is attained.
signal - enrichment signal strength.
Returns the core enriched genes, those that contribute the most to a given gene set.
The gene label permutation step uses random reordering. Results may differ between runs if you do not use a random seed generator (e.g.,
Similar to ORA, GSEA can be performed using GO (
gseGO), KEGG pathways (
gseKEGG), KEGG modules (
gseMKEGG), and WikiPathways (
gseWP). With the
DOSE package, you can also use
gseNCG, and with ReactomePA (
There is also a customizable function similar to
enricher, which is
GSEA using GO (
set.seed(123) gsea_go <- gseGO(geneList = GSEA_data, OrgDb = org.Hs.eg.db, ont = "BP", keyType = "SYMBOL", seed=TRUE)
preparing geneSet collections...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.05% of the list). The order of those tied genes will be arbitrary, which may produce unexpected results.
leading edge analysis...
Plot GSEA results
enrichplot offers several functions specific to GSEA output, especially in regard to visualizing the running score and preranked list (
gseaplot2() allows us to visualize several enriched gene sets at once.
enrichplot::gseaplot2(gsea_go, geneSetID = c(1:2))
enrichplot::gseaplot2(gsea_go, geneSetID = c(29,35))
We can use an upset plot to look at fold distributions over gene sets. This requries the installation of the
enrichplot functions can also be applied to GSEA outputs (e.g., dotplot, cnetplot, enrichmap, etc.) or we can use the ggplot2 functionality.
Using clusterProfiler dplyr verb extensions
We can use
clusterProfiler extensions of
dplyr verbs (
arrange()) for easy manipulation of
mutate() can be particularly powerful and allow us to use different enrichment result metrics (e.g., rich factor and fold enrichment for ORA).
Let's filter the
gseaResult to only include NES results greater than 1.95 or less than -2.
#filter gsea_go2 <- filter(gsea_go,NES > 1.95 | NES < -2) dim(gsea_go2)
 10 11
#replot upset plot enrichplot::upsetplot(gsea_go2,n=10)
#plot using a barplot ggplot(gsea_go2, showCategory=10, aes(NES, fct_reorder(Description, NES), fill=qvalue)) + geom_col() + geom_vline(xintercept=0, linetype="dashed", color="blue",size=1)+ scale_fill_gradientn(colours=c("#b3eebe","#46bac2", "#371ea3"), guide=guide_colorbar(reverse=TRUE))+ scale_x_continuous(expand=c(0,0))+ theme_bw() + #theme(text=element_text(size=8))+ xlab("Normalized Enrichment Score") + ylab(NULL) + ggtitle("GSEA with GO")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0. ℹ Please use `linewidth` instead.
Other packages to check out
- Chicco D, Agapito G (2022) Nine quick tips for pathway enrichment analysis. PLoS Comput Biol 18(8): e1010348. https://doi.org/10.1371/journal.pcbi.1010348.
- Wijesooriya K, Jadaan SA, Perera KL, Kaur T, Ziemann M. Urgent need for consistent standards in functional enrichment analysis. PLoS Comput Biol. 2022 Mar 9;18(3):e1009935. doi: 10.1371/journal.pcbi.1009935. PMID: 35263338; PMCID: PMC8936487..
- Ludwig Geistlinger, Gergely Csaba, Mara Santarelli, Marcel Ramos, Lucas Schiffer, Nitesh Turaga, Charity Law, Sean Davis, Vincent Carey, Martin Morgan, Ralf Zimmer, Levi Waldron, Toward a gold standard for benchmarking gene set enrichment analysis, Briefings in Bioinformatics, Volume 22, Issue 1, January 2021, Pages 545–556, https://doi.org/10.1093/bib/bbz158
- Functional Analysis for RNA-seq: Over-representation Analysis
- Functional Analysis for RNA-seq: Functional Class Scoring
- Functional Enrichment Analysis.
R Session info
R version 4.2.3 (2023-03-15) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur ... 10.16 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  stats4 stats graphics grDevices utils datasets methods  base other attached packages:  enrichplot_1.18.4 ReactomePA_1.42.0 DOSE_3.24.2  lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0  dplyr_1.1.2 purrr_1.0.1 readr_2.1.4  tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2  tidyverse_2.0.0 org.Hs.eg.db_3.16.0 AnnotationDbi_1.60.2  IRanges_2.32.0 S4Vectors_0.36.2 Biobase_2.58.0  BiocGenerics_0.44.0 clusterProfiler_4.6.2 loaded via a namespace (and not attached):  ggnewscale_0.4.8 fgsea_1.24.0 colorspace_2.1-0  ggtree_3.6.2 gson_0.1.0 qvalue_2.30.0  XVector_0.38.0 aplot_0.1.10 rstudioapi_0.14  farver_2.1.1 graphlayouts_1.0.0 ggrepel_0.9.3  bit64_4.0.5 fansi_1.0.4 scatterpie_0.1.9  codetools_0.2-19 splines_4.2.3 cachem_1.0.7  GOSemSim_2.24.0 knitr_1.42 polyclip_1.10-4  jsonlite_1.8.4 GO.db_3.16.0 png_0.1-8  graph_1.76.0 graphite_1.44.0 ggforce_0.4.1  compiler_4.2.3 httr_1.4.5 Matrix_1.5-4  fastmap_1.1.1 lazyeval_0.2.2 cli_3.6.1  tweenr_2.0.2 htmltools_0.5.5 tools_4.2.3  igraph_1.4.2 gtable_0.3.3 glue_1.6.2  GenomeInfoDbData_1.2.9 reshape2_1.4.4 rappdirs_0.3.3  fastmatch_1.1-3 Rcpp_1.0.10 vctrs_0.6.2  Biostrings_2.66.0 ape_5.7-1 nlme_3.1-162  ggraph_2.1.0 xfun_0.39 timechange_0.2.0  lifecycle_1.0.3 zlibbioc_1.44.0 MASS_7.3-59  scales_1.2.1 tidygraph_1.2.3 reactome.db_1.82.0  hms_1.1.3 ggupset_0.3.0 parallel_4.2.3  RColorBrewer_1.1-3 yaml_2.3.7 memoise_2.0.1  gridExtra_2.3 downloader_0.4 ggfun_0.0.9  HDO.db_0.99.1 yulab.utils_0.0.6 stringi_1.7.12  RSQLite_2.3.1 tidytree_0.4.2 BiocParallel_1.32.6  GenomeInfoDb_1.34.9 rlang_1.1.0 pkgconfig_2.0.3  bitops_1.0-7 evaluate_0.20 lattice_0.21-8  labeling_0.4.2 treeio_1.22.0 patchwork_1.1.2  htmlwidgets_1.6.2 cowplot_1.1.1 shadowtext_0.1.2  bit_4.0.5 tidyselect_1.2.0 plyr_1.8.8  magrittr_2.0.3 R6_2.5.1 generics_0.1.3  DBI_1.1.3 pillar_1.9.0 withr_2.5.0  KEGGREST_1.38.0 RCurl_1.98-1.12 crayon_1.5.2  utf8_1.2.3 tzdb_0.3.0 rmarkdown_2.21  viridis_0.6.2 grid_4.2.3 data.table_1.14.8  blob_1.2.4 digest_0.6.31 gridGraphics_0.5-1  munsell_0.5.0 viridisLite_0.4.1 ggplotify_0.1.0