Gene Set Enrichment Analysis (GSEA) with R
Lesson Objectives
- Introduce GSEA
- Discuss options for GSEA in R
- Demo GSEA in R
What is GSEA?
Gene Set Enrichment Analysis (GSEA) is a popular and heavily cited method used for functional enrichment / pathway analysis that "determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes)."
By looking at the expression patterns across all genes in a data set, GSEA allows researchers to identify biological pathways or processes that are significantly associated with a particular phenotype or condition, even when individual genes may not show significant changes in expression. Unlike over-representation analysis, it is not sensitive to arbitrary user-defined choices like gene selection.
Like other functional enrichment tools, GSEA can help prioritize candidate genes, proteins, or metabolites for further investigation and provide insights into the underlying biology of complex diseases.
Compare with DAVID
In a past Coding Club, we looked at ORA with DAVID. You can check out that lesson here: (An Introduction to DAVID for Functional Enrichment Analysis).
Steps in a GSEA Analysis
1) Create a ranked list of genes. More on ranking metrics below.
2) Select gene sets to test. There are many available databases with pre-defined gene sets. You should choose gene sets relevant to your experimental conditions / hypotheses.
Gene set does NOT equal pathway
Gene sets are collections of genes formed on the basis of shared biological or functional properties including molecular interactions, regulation, molecular product(s), and even phenotype associations. A gene set can represent a pathway, but not all gene sets are pathways.
3) Calculate the enrichment score (ES). The ES reflects the degree to which the genes in the set are over-represented at the top or bottom of the ranked list.
4) Calculate significance. A modified Komologorov-Smirnov (KS) test is used to determine whether the distribution of genes in a gene set within the ranked list is non-random. In a pre-ranked test, a null distribution of ES is generated by permuting the gene labels, and the observed ES is compared to this null distribution to determine statistical significance. Statistical procedures for p-value generation vary.
5) Apply multiple testing correction. Correct nominal p-value for multiple testing of many gene sets often using false discovery rate (FDR) or Benjamin-Hochberg.
6) Visualize results.
Figure 1: Understanding GSEA.
An enrichment score is calculated by walking down the list of features, increasing a running-sum statistic when a feature in the target feature set is encountered and decreasing it when it is not. The final score is the maximum deviation from zero encountered in the random walk. A normalized score can be obtained by computing the z-score of the estimate compared to a null distribution obtained from N random permutations. The normalized score accounts for differences in gene set size and can be used to compare the enrichment of different gene sets. The genes in the leading edge (see red box) are the genes driving the enrichment (Figure 1).
Things to note
1) The original implementation of GSEA is the BROAD's GSEA Software. The original publication is here. This software has implementations for self-contained tests (using sample permutations) and competitive tests (using gene label permutations). The original self-contained method is more robust but requires larger sample sizes. There is a modified version of this in R, but it is not well documented (See fgsea::fgseaLabel()
).
Self-contained vs Competitive
Self-contained hypothesis:
- Null hypothesis: "no genes in a given gene set are differentially expressed.
- Permutes phenotype labels
- "Focuses on the genes in a given gene set and ignores genes outside the set, providing strong statistical power and rejecting more null hypotheses".
- Requires at least 7 samples per condition
Competitive hypothesis: (pre-ranked GSEA by default)
- Null hypothesis: "genes in a given gene set are at most as often differentially expressed as the genes not in the set."
- Permutes gene labels.
- "Compares genes within a set to genes outside the set."
- Ignores gene-gene correlations leading to more false positives.
Details from Powers et al. 2018
2) Here, we are using a pre-ranked implementation of GSEA known as FGSEA, which is also a competitive method. This method is fast, scalable, and more appropriate for data sets with smaller numbers of replicates (due to gene sampling). Read more about FGSEA and differences from the original pre-ranked GSEA here. In Particular, in FGSEA the p-value estimation is based on an adaptive multi-level split Monte-Carlo scheme, while the multiple comparison correction uses standard BH-adjusted p-values, which makes this method less conservative than the Broad GSEA.
Tools to perform GSEA
There are several tools and packages available for performing GSEA, including but not limited to the following:
- GSEA software: The original GSEA software developed by the Broad Institute, which provides a user-friendly interface for performing GSEA and visualizing results.
- clusterProfiler: An R package that provides functions for GSEA and visualization of results. It supports various gene set databases (e.g., GO, Reactome, KEGG, etc.) and allows for easy integration with other R packages.
- fgsea: An R package that implements a fast version of GSEA, allowing for large-scale analysis of gene sets. Used by clusterProfiler.
- WebGestalt: A web-based tool that provides a comprehensive platform for gene set analysis, including GSEA, pathway analysis, and functional enrichment analysis.
- Qlucore: A web-based tool that provides a user-friendly interface for performing GSEA and visualizing results. It supports various gene set databases and allows for easy integration with other tools. CCR licenses available.
- PartekFlow: A commercial software package that provides a comprehensive platform for gene expression analysis, including GSEA and pathway analysis. It offers a user-friendly interface and supports various gene set databases. NCI licenses available.
Why Use GSEA in R?
-
The Broad version of GSEA minimally requires normalized expression data from RNA-Seq Experiments. For example, you can use:
- normalised RNA-seq counts via DESeq2's 'geometric' normalization, EdgeR's TMM method, etc.
- normalised + transformed RNA-seq expression levels, such as variance-stabilized (vst) or regularized log (rlog) expression levels from DESeq2, or log2 CPMs from EdgeR --- Biostars.
Many of the packages used for normalization and differential expression analysis are already available in R, making it a convenient environment for performing GSEA.
-
R facilitates programmatic implementation of GSEA via a pipeline, and FGSEA enables fast and scalable analysis.
- With R, you can more easily visualize and customize GSEA results.
clusterProfiler
While the fast pre-ranked GSEA (FGSEA) is available through Bioconductor in the package fgsea, it is also implemented in clusterProfiler and other packages.
clusterProfiler is an R package used for functional enrichment analysis (either ORA or GSEA). See the latest publication here. The primary documentation for clusterProfiler can be found here.
Being a Bioconductor package, clusterProfiler is a good choice because
- It has ORA and GSEA support,
- Functionality for multiple ontology and pathway databases, and
- It allows user defined databases and annotations, which is particularly useful for non-model organisms.
If you are interested in the general capabilities of clusterProfiler, see this 2023 tutorial from the BTEP Coding Club.
Load the Packages
library(clusterProfiler) #Bioconductor
library(tidyverse) #CRAN
library(org.Hs.eg.db) #Bioconductor
library(msigdbr) #CRAN
The Data
GSEA analysis requires two inputs, a ranked gene list and gene set collections.
For this example, we will create a ranked gene list using differential expression results from DESeq2 from a tumor-normal experiment. To obtain the data download here.
Load the differential expression results:
degs <- read_csv("deg_NormVTumor.csv") |>
relocate(GeneName, .after=EnsemblID)
Rows: 24456 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): EnsemblID, GeneName
dbl (7): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, posteriorLFC
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
degs
# A tibble: 24,456 × 9
EnsemblID GeneName baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000175899 A2M 19058. 7.45 0.0708 105. 0 0
2 ENSG00000205002 AARD 11578. 10.2 0.232 44.0 0 0
3 ENSG00000090861 AARS1 37863. 2.30 0.0150 153. 0 0
4 ENSG00000165029 ABCA1 2354. 4.14 0.0786 52.7 0 0
5 ENSG00000144452 ABCA12 19414. -9.79 0.154 -63.7 0 0
6 ENSG00000107331 ABCA2 5039. 3.18 0.0443 71.8 0 0
7 ENSG00000103222 ABCC1 36135. 2.08 0.0140 148. 0 0
8 ENSG00000108846 ABCC3 16944. 10.5 0.209 50.1 0 0
9 ENSG00000125257 ABCC4 8719. -2.17 0.0274 -79.3 0 0
10 ENSG00000143322 ABL2 14963. 2.30 0.0214 108. 0 0
# ℹ 24,446 more rows
# ℹ 1 more variable: posteriorLFC <dbl>
Ranking
We are using a pre-ranked version of GSEA (FGSEA). Therefore, the first step is to rank our genes. This can be done in many different ways, and the ranking metric chosen can bias results. You should choose an appropriate ranking metric based on your experimental design and how the data was processed.
Choosing a ranking metric
Commonly used ranking metrics include:
-
Fold Change (preferably shrunken values if DESeq2)
- Genes are ranked by the magnitude of change between conditions, but inflated fold changes are often observed for genes with small counts or large standard errors. "To generate more accurate log2 foldchange estimates, DESeq2 allows for the shrinkage of the LFC estimates toward zero when the information for a gene is low." Shrunken values are often suggested for ranking purposes.
-
Combination of p-value and fold change (e.g., rank by p-value, then by log-fold change)
-
Signed -log10(p-value) (e.g.,
sign(df$log2fc)*(-log10(df$pval))
) - combines significance with the direction of change. However, 0 p-values will result inInf
/-Inf
and require correction (See here). -
Test Statistic (e.g., t-test statistic, Wald test statistic from DESeq2) - This includes information about the effect size and variability. It is conceptually similar to using the p-value.
-
Signal-to_Noise Ratio -
(mean(expression in group 1) - mean(expression in group 2)) / (sd(group 1) + sd(group 2))
- Used by the Broad Institute GSEA software but requires a minimum of 7 samples per group to be valid.
Create the ranked list
We will use the shrunken log fold change values for this example, which are denoted as "posteriorLFC" in degs
. We need to create a named vector to hold our ranked data.
ranked<-degs %>%
filter(!is.na(padj)) %>% #filter any NA values
arrange(desc(posteriorLFC)) %>% #arrange the data by ranking metric
pull(posteriorLFC,name=EnsemblID) #create named vector
head(ranked)
ENSG00000169247 ENSG00000130881 ENSG00000087116 ENSG00000128045 ENSG00000187627
14.86425 14.20553 13.96218 13.74109 13.67227
ENSG00000182013
13.16344
Gene IDs
There are many options for gene IDs. Many tools prefer Entrez IDs or official gene symbols. However, you can convert between IDs fairly easily with clusterProfiler. Keep in mind that gene annotations may differ between genome builds, so take care when converting IDs or using different databases. Here, I am sticking with Ensembl IDs to avoid issues with duplicated gene symbols. Regardless of the ID that you choose, it is important that the gene IDs in your ranked list match the gene IDs in the gene sets tested.
Running GSEA
GSEA can be performed using GO (gseGO
), KEGG pathways (gseKEGG
), KEGG modules (gseMKEGG
), and WikiPathways (gseWP
). With the DOSE package, you can use DisGeNET (DOSE::gseDGN
), Disease Ontology (DOSE::gseDO
), and Network of Cancer Genes and Healthy Drivers (DOSE::gseNCG
), and with ReactomePA (ReactomePA::gsePathway
). Additionally, there is a function (GSEA()
) that can be used in the event that a user is working with "unsupported organisms, slim versions of GO, novel functional annotation (e.g. GO via BlastGO or KEGG via KAAS), unsupported ontologies/pathways or customized annotations."
Obtain the gene sets
What is MSigDB and how does it relate to GSEA?
The Molecular Signatures Database (MSigDB) is a curated resource of thousands of gene sets by the Broad Institute. These sets were curated for use with GSEA software but are used with other tools as well.
MSigDB includes both human and mouse collections, and there are 35,134 gene sets in the most recent release of the Human Molecular Signatures Database (MSigDB v2025.1.Hs updated June 2025).
There are 9 larger themed collections:
Figure 2: The MSigDB Collections
Highlighted examples:
C5 - the gene ontology (GO) collection.
C2 - curated gene sets from publications and pathway databases (e.g., KEGG and REACTOME).
Hallmark collection - a summary list of well-defined gene sets with decreased redundancy. A good choice for many studies.
C7 - great for immunological research.
There is not a GSEA specific function from clusterProfiler incorporating the MSigDB gene sets, but we can use GSEA()
with gene sets obtained from MSigDB. Signatures can be obtained by:
-
Downloading directly from MSigDB and using
read.gmt()
from clusterProfiler to parse the gmt files. Only official gene symbols and Entrez IDs are available in the gmt files. However, you can also access the chip files for Ensembl IDs and other annotations. -
Using
library(msigdbr)
.
The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:
- in an R-friendly “tidy” format with one gene pair per row.
- for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
- as gene symbols as well as NCBI Entrez and Ensembl IDs without accessing external resources requiring an active internet connection ---igordot.github.io/msigdbr/.
We can view the supported species and available collections using
msigdbr_species() # to see the supported species
# A tibble: 20 × 2
species_name species_common_name
<chr> <chr>
1 Anolis carolinensis Carolina anole, green anole
2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
3 Caenorhabditis elegans <NA>
4 Canis lupus familiaris dog, dogs
5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
6 Drosophila melanogaster fruit fly
7 Equus caballus domestic horse, equine, horse
8 Felis catus cat, cats, domestic cat
9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
10 Homo sapiens human
11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
12 Monodelphis domestica gray short-tailed opossum
13 Mus musculus house mouse, mouse
14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
15 Pan troglodytes chimpanzee
16 Rattus norvegicus brown rat, Norway rat, rat, rats
17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
18 Schizosaccharomyces pombe 972h- <NA>
19 Sus scrofa pig, pigs, swine, wild boar
20 Xenopus tropicalis tropical clawed frog, western clawed frog
msigdbr_collections() # to see available collections / subcollections
# A tibble: 25 × 4
gs_collection gs_subcollection gs_collection_name num_genesets
<chr> <chr> <chr> <int>
1 C1 "" Positional 302
2 C2 "CGP" Chemical and Genetic Perturbati… 3494
3 C2 "CP" Canonical Pathways 19
4 C2 "CP:BIOCARTA" BioCarta Pathways 292
5 C2 "CP:KEGG_LEGACY" KEGG Legacy Pathways 186
6 C2 "CP:KEGG_MEDICUS" KEGG Medicus Pathways 658
7 C2 "CP:PID" PID Pathways 196
8 C2 "CP:REACTOME" Reactome Pathways 1736
9 C2 "CP:WIKIPATHWAYS" WikiPathways 830
10 C3 "MIR:MIRDB" miRDB 2377
# ℹ 15 more rows
Note
The version of msigdbr
now matches the version of msigdb queried. As of June 2025, the current version of MSigDB is v2025.1.Hs for human, so msigdbr is not using the most up-to-date version.
To retrieve all human gene sets from msigdb we use
#default is Homo sapiens
msigdbr(species = "Homo sapiens", db_species = "HS")
Choosing your gene sets
You should use your hypotheses to determine what gene sets to include in your analysis. As you increase the number of gene sets included in your analysis, the chances of false negatives will increase. The increased number of conducted tests increases the effect of multiple testing correction and reduces the multiple-test adjusted significance of individual pathways.
You can obtain specific gene sets using the collection
and subcollection
arguments of msigdbr. Here, we will explore the Hallmark collection.
hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes. - MSigDB
hallmark<- msigdbr(collection = "H") %>%
dplyr::select(gs_name, ensembl_gene)
head(hallmark)
# A tibble: 6 × 2
gs_name ensembl_gene
<chr> <chr>
1 HALLMARK_ADIPOGENESIS ENSG00000165029
2 HALLMARK_ADIPOGENESIS ENSG00000197150
3 HALLMARK_ADIPOGENESIS ENSG00000167315
4 HALLMARK_ADIPOGENESIS ENSG00000115361
5 HALLMARK_ADIPOGENESIS ENSG00000117054
6 HALLMARK_ADIPOGENESIS ENSG00000122971
h2 <- msigdbr(collection = "H") %>%
dplyr::select(gs_name, gs_description)
Alternatively, you could filter a data frame of all signatures using dplyr
or base R syntax.
Run GSEA
Using MSigDB, we need to use the generic GSEA()
from clusterProfiler.
By default, GSEA()
will use FGSEA to run the analysis, which defaults to fgsea::fgsea()
and in turn to fgsea::fgseaMultilevel()
. Use ?GSEA()
for the function documentation.
Arguments
Required arguments:
geneList
- the ordered ranked gene list.
TERM2GENE
- a data frame including the terms and genes (the custom gene sets, which in this case were from MSigDB).
Optional arguments:
minGSSize
- the minimum size of a gene set to be analyzed (default = 10).
maxGSSize
- the maximum size of a gene set to be analyzed (default = 500).
TERM2NAME
- a data frame with the user input of TERM TO NAME mapping. You can include additional descriptions here.
pvalueCutoff
- this controls the output in the resulting object. (default = 0.05).
seed
- there is a random component to GSEA due to the gene permutations, and results will differ slightly with each run. To obtain consistent results, you can use set.seed
outside of GSEA()
and seed=TRUE
within GSEA()
.
minGSSize and maxGSSize
The normalization of enrichment scores is "not very accurate for extremely small or extremely large gene sets. For example, for gene sets with fewer than 10 genes, just 2 or 3 genes can generate significant results.".
Other reasons for limiting large gene sets include:
- The run time of the analysis will increase with larger gene sets.
- Large gene sets are more general and therefore, more difficult to interpret.
- "The GSEA null hypothesis is that the set is distributed uniformly at random along the gene ranking. For larger, structured gene sets that tends to be the case, even when there is no biological significance. In particular, artifacts from gene-gene expression correlation start to play an important role. --- Alexey Sergushichev, FGSEA author
by="DOSE
There is an alternative analysis option for GSEA (by="DOSE
). This is the original implementation of GSEA. It is very slow and poorly documented. Use with caution.
#set seed
set.seed(123)
#run GSEA
eh <- GSEA(ranked, TERM2GENE = hallmark, TERM2NAME = h2, seed=TRUE)
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.95% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some of the pathways the P-values were likely overestimated. For
such pathways log2err is set to NA.
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some pathways, in reality P-values are less than 1e-10. You can
set the `eps` argument to zero for better estimation.
leading edge analysis...
done...
head(eh)
ID
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION
HALLMARK_ALLOGRAFT_REJECTION HALLMARK_ALLOGRAFT_REJECTION
HALLMARK_UV_RESPONSE_DN HALLMARK_UV_RESPONSE_DN
HALLMARK_INTERFERON_GAMMA_RESPONSE HALLMARK_INTERFERON_GAMMA_RESPONSE
HALLMARK_ESTROGEN_RESPONSE_EARLY HALLMARK_ESTROGEN_RESPONSE_EARLY
HALLMARK_IL2_STAT5_SIGNALING HALLMARK_IL2_STAT5_SIGNALING
Description
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION Genes defining epithelial-mesenchymal transition, as in wound healing, fibrosis and metastasis.
HALLMARK_ALLOGRAFT_REJECTION Genes up-regulated during transplant rejection.
HALLMARK_UV_RESPONSE_DN Genes down-regulated in response to ultraviolet (UV) radiation.
HALLMARK_INTERFERON_GAMMA_RESPONSE Genes up-regulated in response to IFNG [GeneID=3458].
HALLMARK_ESTROGEN_RESPONSE_EARLY Genes defining early response to estrogen.
HALLMARK_IL2_STAT5_SIGNALING Genes up-regulated by STAT5 in response to IL2 stimulation.
setSize enrichmentScore NES
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 191 0.5970040 2.256140
HALLMARK_ALLOGRAFT_REJECTION 174 -0.6376018 -2.140916
HALLMARK_UV_RESPONSE_DN 143 0.5191196 1.916448
HALLMARK_INTERFERON_GAMMA_RESPONSE 191 -0.5078924 -1.715399
HALLMARK_ESTROGEN_RESPONSE_EARLY 186 0.4543630 1.711498
HALLMARK_IL2_STAT5_SIGNALING 194 -0.4900388 -1.656697
pvalue p.adjust
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 1.000000e-10 2.500000e-09
HALLMARK_ALLOGRAFT_REJECTION 1.000000e-10 2.500000e-09
HALLMARK_UV_RESPONSE_DN 1.444408e-07 2.407347e-06
HALLMARK_INTERFERON_GAMMA_RESPONSE 2.022417e-06 2.112346e-05
HALLMARK_ESTROGEN_RESPONSE_EARLY 2.112346e-06 2.112346e-05
HALLMARK_IL2_STAT5_SIGNALING 1.639905e-05 1.366587e-04
qvalue rank
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 1.526316e-09 3277
HALLMARK_ALLOGRAFT_REJECTION 1.526316e-09 2156
HALLMARK_UV_RESPONSE_DN 1.469749e-06 3044
HALLMARK_INTERFERON_GAMMA_RESPONSE 1.289643e-05 5160
HALLMARK_ESTROGEN_RESPONSE_EARLY 1.289643e-05 2832
HALLMARK_IL2_STAT5_SIGNALING 8.343375e-05 3073
leading_edge
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION tags=50%, list=13%, signal=43%
HALLMARK_ALLOGRAFT_REJECTION tags=36%, list=9%, signal=33%
HALLMARK_UV_RESPONSE_DN tags=31%, list=12%, signal=28%
HALLMARK_INTERFERON_GAMMA_RESPONSE tags=48%, list=21%, signal=38%
HALLMARK_ESTROGEN_RESPONSE_EARLY tags=28%, list=12%, signal=25%
HALLMARK_IL2_STAT5_SIGNALING tags=28%, list=13%, signal=25%
core_enrichment
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION ENSG00000011465/ENSG00000182492/ENSG00000111341/ENSG00000166033/ENSG00000139329/ENSG00000142173/ENSG00000087245/ENSG00000180447/ENSG00000129038/ENSG00000109099/ENSG00000153071/ENSG00000142871/ENSG00000137801/ENSG00000166741/ENSG00000149968/ENSG00000133110/ENSG00000100234/ENSG00000135919/ENSG00000163359/ENSG00000130635/ENSG00000077942/ENSG00000019549/ENSG00000166923/ENSG00000171951/ENSG00000120708/ENSG00000118523/ENSG00000163430/ENSG00000111799/ENSG00000134871/ENSG00000176692/ENSG00000134013/ENSG00000132561/ENSG00000150630/ENSG00000122641/ENSG00000101825/ENSG00000168542/ENSG00000184347/ENSG00000112186/ENSG00000087303/ENSG00000106333/ENSG00000087494/ENSG00000065534/ENSG00000157227/ENSG00000164692/ENSG00000187498/ENSG00000006016/ENSG00000073712/ENSG00000124225/ENSG00000164932/ENSG00000115414/ENSG00000063660/ENSG00000140416/ENSG00000006327/ENSG00000172061/ENSG00000084636/ENSG00000106823/ENSG00000145147/ENSG00000152661/ENSG00000041982/ENSG00000150093/ENSG00000122786/ENSG00000171812/ENSG00000163520/ENSG00000198467/ENSG00000148848/ENSG00000170558/ENSG00000140092/ENSG00000124145/ENSG00000113083/ENSG00000163710/ENSG00000114251/ENSG00000169429/ENSG00000186340/ENSG00000163739/ENSG00000149257/ENSG00000107984/ENSG00000082781/ENSG00000073008/ENSG00000123384/ENSG00000143369/ENSG00000138685/ENSG00000112715/ENSG00000105290/ENSG00000146674/ENSG00000161638/ENSG00000128595/ENSG00000049540/ENSG00000080573/ENSG00000164176/ENSG00000134250/ENSG00000106397/ENSG00000113140/ENSG00000101335/ENSG00000164949/ENSG00000140937
HALLMARK_ALLOGRAFT_REJECTION ENSG00000113520/ENSG00000254087/ENSG00000232810/ENSG00000015285/ENSG00000109943/ENSG00000125538/ENSG00000126759/ENSG00000204287/ENSG00000162711/ENSG00000115085/ENSG00000163519/ENSG00000169194/ENSG00000043462/ENSG00000172724/ENSG00000136634/ENSG00000140749/ENSG00000066336/ENSG00000104814/ENSG00000124882/ENSG00000100365/ENSG00000116824/ENSG00000118971/ENSG00000072694/ENSG00000147889/ENSG00000096996/ENSG00000054219/ENSG00000160791/ENSG00000019582/ENSG00000113263/ENSG00000111537/ENSG00000140030/ENSG00000060982/ENSG00000180644/ENSG00000153283/ENSG00000000938/ENSG00000100385/ENSG00000090339/ENSG00000137265/ENSG00000122862/ENSG00000113302/ENSG00000005844/ENSG00000154096/ENSG00000196735/ENSG00000185507/ENSG00000081237/ENSG00000163131/ENSG00000180353/ENSG00000182866/ENSG00000147168/ENSG00000121594/ENSG00000101017/ENSG00000105369/ENSG00000140968/ENSG00000137078/ENSG00000113532/ENSG00000204252/ENSG00000102962/ENSG00000275302/ENSG00000166501/ENSG00000114013/ENSG00000112799/ENSG00000134460/ENSG00000186810
HALLMARK_UV_RESPONSE_DN ENSG00000115461/ENSG00000170430/ENSG00000109099/ENSG00000153071/ENSG00000142871/ENSG00000169439/ENSG00000122420/ENSG00000019549/ENSG00000170961/ENSG00000166949/ENSG00000168542/ENSG00000112186/ENSG00000105976/ENSG00000164741/ENSG00000170425/ENSG00000147862/ENSG00000169047/ENSG00000164692/ENSG00000070778/ENSG00000125968/ENSG00000140443/ENSG00000049323/ENSG00000152661/ENSG00000117525/ENSG00000064309/ENSG00000196876/ENSG00000140092/ENSG00000058668/ENSG00000171132/ENSG00000010810/ENSG00000147852/ENSG00000182718/ENSG00000163513/ENSG00000105974/ENSG00000132170/ENSG00000180340/ENSG00000104067/ENSG00000173482/ENSG00000099250/ENSG00000005249/ENSG00000174738/ENSG00000115380/ENSG00000078053/ENSG00000134250/ENSG00000157110
HALLMARK_INTERFERON_GAMMA_RESPONSE ENSG00000115415/ENSG00000158321/ENSG00000068079/ENSG00000104312/ENSG00000156587/ENSG00000119922/ENSG00000067057/ENSG00000169245/ENSG00000205220/ENSG00000133639/ENSG00000197329/ENSG00000184588/ENSG00000002549/ENSG00000077238/ENSG00000138378/ENSG00000131979/ENSG00000140853/ENSG00000164136/ENSG00000164305/ENSG00000100906/ENSG00000166710/ENSG00000174175/ENSG00000111335/ENSG00000135148/ENSG00000204257/ENSG00000168310/ENSG00000137959/ENSG00000111331/ENSG00000026103/ENSG00000177409/ENSG00000206503/ENSG00000184979/ENSG00000066583/ENSG00000157601/ENSG00000173193/ENSG00000124256/ENSG00000198805/ENSG00000119917/ENSG00000196116/ENSG00000073756/ENSG00000271503/ENSG00000116711/ENSG00000134470/ENSG00000111679/ENSG00000118640/ENSG00000213809/ENSG00000115267/ENSG00000162692/ENSG00000124762/ENSG00000216490/ENSG00000140280/ENSG00000115525/ENSG00000104432/ENSG00000130303/ENSG00000183347/ENSG00000137752/ENSG00000185338/ENSG00000132530/ENSG00000118503/ENSG00000127951/ENSG00000185201/ENSG00000174600/ENSG00000185215/ENSG00000121858/ENSG00000179583/ENSG00000196126/ENSG00000168899/ENSG00000128604/ENSG00000143184/ENSG00000043462/ENSG00000120217/ENSG00000162654/ENSG00000183486/ENSG00000019582/ENSG00000136514/ENSG00000100385/ENSG00000090339/ENSG00000137265/ENSG00000133106/ENSG00000196735/ENSG00000185507/ENSG00000110848/ENSG00000004468/ENSG00000101017/ENSG00000139626/ENSG00000140968/ENSG00000113532/ENSG00000100368/ENSG00000114013/ENSG00000026751/ENSG00000110324
HALLMARK_ESTROGEN_RESPONSE_EARLY ENSG00000130558/ENSG00000163083/ENSG00000152137/ENSG00000087510/ENSG00000074410/ENSG00000111057/ENSG00000064205/ENSG00000197977/ENSG00000171346/ENSG00000003137/ENSG00000154153/ENSG00000155792/ENSG00000164741/ENSG00000170421/ENSG00000172137/ENSG00000054598/ENSG00000138119/ENSG00000150687/ENSG00000006534/ENSG00000125850/ENSG00000075275/ENSG00000185052/ENSG00000151892/ENSG00000140443/ENSG00000182704/ENSG00000164128/ENSG00000152661/ENSG00000168453/ENSG00000173156/ENSG00000147394/ENSG00000103257/ENSG00000196372/ENSG00000146242/ENSG00000113739/ENSG00000100003/ENSG00000071242/ENSG00000198682/ENSG00000170345/ENSG00000175793/ENSG00000068615/ENSG00000088002/ENSG00000187720/ENSG00000164742/ENSG00000171345/ENSG00000123066/ENSG00000152767/ENSG00000124839/ENSG00000136826/ENSG00000103067/ENSG00000162496/ENSG00000173227/ENSG00000100321/ENSG00000003989
HALLMARK_IL2_STAT5_SIGNALING ENSG00000107968/ENSG00000099860/ENSG00000042493/ENSG00000069667/ENSG00000198959/ENSG00000100473/ENSG00000159399/ENSG00000117595/ENSG00000061918/ENSG00000185291/ENSG00000185338/ENSG00000127951/ENSG00000028137/ENSG00000123243/ENSG00000121858/ENSG00000025039/ENSG00000056558/ENSG00000132718/ENSG00000027075/ENSG00000120833/ENSG00000101160/ENSG00000135318/ENSG00000169194/ENSG00000049249/ENSG00000136634/ENSG00000186827/ENSG00000171791/ENSG00000186891/ENSG00000163508/ENSG00000118495/ENSG00000162654/ENSG00000147872/ENSG00000118971/ENSG00000115590/ENSG00000085563/ENSG00000169432/ENSG00000140030/ENSG00000196664/ENSG00000030419/ENSG00000100385/ENSG00000007312/ENSG00000120659/ENSG00000137265/ENSG00000188404/ENSG00000169291/ENSG00000140968/ENSG00000168421/ENSG00000120949/ENSG00000143891/ENSG00000117091/ENSG00000114013/ENSG00000163600/ENSG00000134460/ENSG00000110324/ENSG00000156127
Results
The resulting object eh
is a gseaResult object. This object contains the results (eh@result
) and other information that went into the analysis, for example, @organism
type, @setType
, the @geneSets
used, the genes in our @geneList
, and the @params
used for the analysis. Note the use of @
, which is used to access various slots within the object.
glimpse(eh)
Formal class 'gseaResult' [package "DOSE"] with 13 slots
..@ result :'data.frame': 18 obs. of 11 variables:
.. ..$ ID : chr [1:18] "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION" "HALLMARK_ALLOGRAFT_REJECTION" "HALLMARK_UV_RESPONSE_DN" "HALLMARK_INTERFERON_GAMMA_RESPONSE" ...
.. ..$ Description : chr [1:18] "Genes defining epithelial-mesenchymal transition, as in wound healing, fibrosis and metastasis." "Genes up-regulated during transplant rejection." "Genes down-regulated in response to ultraviolet (UV) radiation." "Genes up-regulated in response to IFNG [GeneID=3458]." ...
.. ..$ setSize : int [1:18] 191 174 143 191 186 194 178 189 169 181 ...
.. ..$ enrichmentScore: num [1:18] 0.597 -0.638 0.519 -0.508 0.454 ...
.. ..$ NES : num [1:18] 2.26 -2.14 1.92 -1.72 1.71 ...
.. ..$ pvalue : num [1:18] 1.00e-10 1.00e-10 1.44e-07 2.02e-06 2.11e-06 ...
.. ..$ p.adjust : num [1:18] 2.50e-09 2.50e-09 2.41e-06 2.11e-05 2.11e-05 ...
.. ..$ qvalue : num [1:18] 1.53e-09 1.53e-09 1.47e-06 1.29e-05 1.29e-05 ...
.. ..$ rank : num [1:18] 3277 2156 3044 5160 2832 ...
.. ..$ leading_edge : chr [1:18] "tags=50%, list=13%, signal=43%" "tags=36%, list=9%, signal=33%" "tags=31%, list=12%, signal=28%" "tags=48%, list=21%, signal=38%" ...
.. ..$ core_enrichment: chr [1:18] "ENSG00000011465/ENSG00000182492/ENSG00000111341/ENSG00000166033/ENSG00000139329/ENSG00000142173/ENSG00000087245"| __truncated__ "ENSG00000113520/ENSG00000254087/ENSG00000232810/ENSG00000015285/ENSG00000109943/ENSG00000125538/ENSG00000126759"| __truncated__ "ENSG00000115461/ENSG00000170430/ENSG00000109099/ENSG00000153071/ENSG00000142871/ENSG00000169439/ENSG00000122420"| __truncated__ "ENSG00000115415/ENSG00000158321/ENSG00000068079/ENSG00000104312/ENSG00000156587/ENSG00000119922/ENSG00000067057"| __truncated__ ...
..@ organism : chr "UNKNOWN"
..@ setType : chr "UNKNOWN"
..@ geneSets :List of 50
.. ..$ HALLMARK_ADIPOGENESIS : chr [1:200] "ENSG00000165029" "ENSG00000197150" "ENSG00000167315" "ENSG00000115361" ...
.. ..$ HALLMARK_ALLOGRAFT_REJECTION : chr [1:200] "ENSG00000090861" "ENSG00000164163" "ENSG00000136754" "ENSG00000087085" ...
.. ..$ HALLMARK_ANDROGEN_RESPONSE : chr [1:101] "ENSG00000125257" "ENSG00000140526" "ENSG00000123983" "ENSG00000072110" ...
.. ..$ HALLMARK_ANGIOGENESIS : chr [1:36] "ENSG00000091583" "ENSG00000142192" "ENSG00000118971" "ENSG00000168542" ...
.. ..$ HALLMARK_APICAL_JUNCTION : chr [1:200] "ENSG00000143632" "ENSG00000075624" "ENSG00000159251" "ENSG00000184009" ...
.. ..$ HALLMARK_APICAL_SURFACE : chr [1:44] "ENSG00000137845" "ENSG00000006831" "ENSG00000169129" "ENSG00000118507" ...
.. ..$ HALLMARK_APOPTOSIS : chr [1:161] "ENSG00000087274" "ENSG00000183773" "ENSG00000154122" "ENSG00000135046" ...
.. ..$ HALLMARK_BILE_ACID_METABOLISM : chr [1:112] "ENSG00000165029" "ENSG00000107331" "ENSG00000167972" "ENSG00000198691" ...
.. ..$ HALLMARK_CHOLESTEROL_HOMEOSTASIS : chr [1:74] "ENSG00000107331" "ENSG00000120437" "ENSG00000131069" "ENSG00000184009" ...
.. ..$ HALLMARK_COAGULATION : chr [1:139] "ENSG00000175899" "ENSG00000168306" "ENSG00000168615" "ENSG00000214274" ...
.. ..$ HALLMARK_COMPLEMENT : chr [1:201] "ENSG00000077522" "ENSG00000168615" "ENSG00000274286" "ENSG00000108599" ...
.. ..$ HALLMARK_DNA_REPAIR : chr [1:150] "ENSG00000094914" "ENSG00000196839" "ENSG00000174233" "ENSG00000130706" ...
.. ..$ HALLMARK_E2F_TARGETS : chr [1:200] "ENSG00000004455" "ENSG00000143401" "ENSG00000111875" "ENSG00000105011" ...
.. ..$ HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION: chr [1:200] "ENSG00000154175" "ENSG00000107796" "ENSG00000148848" "ENSG00000166825" ...
.. ..$ HALLMARK_ESTROGEN_RESPONSE_EARLY : chr [1:200] "ENSG00000183044" "ENSG00000167972" "ENSG00000140526" "ENSG00000099204" ...
.. ..$ HALLMARK_ESTROGEN_RESPONSE_LATE : chr [1:200] "ENSG00000167972" "ENSG00000140526" "ENSG00000168306" "ENSG00000148700" ...
.. ..$ HALLMARK_FATTY_ACID_METABOLISM : chr [1:158] "ENSG00000109576" "ENSG00000060971" "ENSG00000167315" "ENSG00000115361" ...
.. ..$ HALLMARK_G2M_CHECKPOINT : chr [1:200] "ENSG00000097007" "ENSG00000123505" "ENSG00000032219" "ENSG00000169136" ...
.. ..$ HALLMARK_GLYCOLYSIS : chr [1:200] "ENSG00000115657" "ENSG00000170425" "ENSG00000162688" "ENSG00000188157" ...
.. ..$ HALLMARK_HEDGEHOG_SIGNALING : chr [1:36] "ENSG00000087085" "ENSG00000205336" "ENSG00000126016" "ENSG00000176749" ...
.. ..$ HALLMARK_HEME_METABOLISM : chr [1:200] "ENSG00000115657" "ENSG00000118777" "ENSG00000213088" "ENSG00000102575" ...
.. ..$ HALLMARK_HYPOXIA : chr [1:200] "ENSG00000144476" "ENSG00000148926" "ENSG00000170425" "ENSG00000162433" ...
.. ..$ HALLMARK_IL2_STAT5_SIGNALING : chr [1:199] "ENSG00000085563" "ENSG00000135074" "ENSG00000204305" "ENSG00000101444" ...
.. ..$ HALLMARK_IL6_JAK_STAT3_SIGNALING : chr [1:87] "ENSG00000175899" "ENSG00000135503" "ENSG00000139567" "ENSG00000030110" ...
.. ..$ HALLMARK_INFLAMMATORY_RESPONSE : chr [1:201] "ENSG00000165029" "ENSG00000136754" "ENSG00000135503" "ENSG00000121989" ...
.. ..$ HALLMARK_INTERFERON_ALPHA_RESPONSE : chr [1:97] "ENSG00000160710" "ENSG00000166710" "ENSG00000168062" "ENSG00000130303" ...
.. ..$ HALLMARK_INTERFERON_GAMMA_RESPONSE : chr [1:200] "ENSG00000160710" "ENSG00000221963" "ENSG00000150347" "ENSG00000122644" ...
.. ..$ HALLMARK_KRAS_SIGNALING_DN : chr [1:200] "ENSG00000073734" "ENSG00000172350" "ENSG00000159251" "ENSG00000184160" ...
.. ..$ HALLMARK_KRAS_SIGNALING_UP : chr [1:200] "ENSG00000085563" "ENSG00000159640" "ENSG00000151694" "ENSG00000151651" ...
.. ..$ HALLMARK_MITOTIC_SPINDLE : chr [1:199] "ENSG00000136754" "ENSG00000097007" "ENSG00000159842" "ENSG00000130402" ...
.. ..$ HALLMARK_MTORC1_SIGNALING : chr [1:200] "ENSG00000033050" "ENSG00000278540" "ENSG00000131473" "ENSG00000123983" ...
.. ..$ HALLMARK_MYC_TARGETS_V1 : chr [1:200] "ENSG00000164163" "ENSG00000143727" "ENSG00000106305" "ENSG00000177879" ...
.. ..$ HALLMARK_MYC_TARGETS_V2 : chr [1:58] "ENSG00000106305" "ENSG00000112578" "ENSG00000122565" "ENSG00000135446" ...
.. ..$ HALLMARK_MYOGENESIS : chr [1:200] "ENSG00000099204" "ENSG00000087085" "ENSG00000151726" "ENSG00000143632" ...
.. ..$ HALLMARK_NOTCH_SIGNALING : chr [1:32] "ENSG00000117362" "ENSG00000137486" "ENSG00000110092" "ENSG00000055130" ...
.. ..$ HALLMARK_OXIDATIVE_PHOSPHORYLATION : chr [1:200] "ENSG00000131269" "ENSG00000060971" "ENSG00000167315" "ENSG00000117054" ...
.. ..$ HALLMARK_P53_PATHWAY : chr [1:200] "ENSG00000183044" "ENSG00000114770" "ENSG00000100439" "ENSG00000135503" ...
.. ..$ HALLMARK_PANCREAS_BETA_CELLS : chr [1:40] "ENSG00000006071" "ENSG00000117020" "ENSG00000100604" "ENSG00000077279" ...
.. ..$ HALLMARK_PEROXISOME : chr [1:104] "ENSG00000085563" "ENSG00000005471" "ENSG00000150967" "ENSG00000114770" ...
.. ..$ HALLMARK_PI3K_AKT_MTOR_SIGNALING : chr [1:105] "ENSG00000278540" "ENSG00000138071" "ENSG00000115091" "ENSG00000078295" ...
.. ..$ HALLMARK_PROTEIN_SECRETION : chr [1:96] "ENSG00000165029" "ENSG00000137845" "ENSG00000143401" "ENSG00000166747" ...
.. ..$ HALLMARK_REACTIVE_OXYGEN_SPECIES_PATHWAY : chr [1:49] "ENSG00000103222" "ENSG00000177556" "ENSG00000121691" "ENSG00000129355" ...
.. ..$ HALLMARK_SPERMATOGENESIS : chr [1:135] "ENSG00000159640" "ENSG00000111644" "ENSG00000134940" "ENSG00000148156" ...
.. ..$ HALLMARK_TGF_BETA_SIGNALING : chr [1:54] "ENSG00000115170" "ENSG00000134982" "ENSG00000054267" "ENSG00000137936" ...
.. ..$ HALLMARK_TNFA_SIGNALING_VIA_NFKB : chr [1:200] "ENSG00000165029" "ENSG00000144476" "ENSG00000109321" "ENSG00000162772" ...
.. ..$ HALLMARK_UNFOLDED_PROTEIN_RESPONSE : chr [1:113] "ENSG00000059573" "ENSG00000101199" "ENSG00000070669" "ENSG00000162772" ...
.. ..$ HALLMARK_UV_RESPONSE_DN : chr [1:144] "ENSG00000103222" "ENSG00000121989" "ENSG00000148700" "ENSG00000117114" ...
.. ..$ HALLMARK_UV_RESPONSE_UP : chr [1:158] "ENSG00000085563" "ENSG00000060971" "ENSG00000123908" "ENSG00000023330" ...
.. ..$ HALLMARK_WNT_BETA_CATENIN_SIGNALING : chr [1:42] "ENSG00000151694" "ENSG00000103126" "ENSG00000168646" "ENSG00000118971" ...
.. ..$ HALLMARK_XENOBIOTIC_METABOLISM : chr [1:200] "ENSG00000023839" "ENSG00000108846" "ENSG00000173208" "ENSG00000163686" ...
..@ geneList : Named num [1:24456] 14.9 14.2 14 13.7 13.7 ...
.. ..- attr(*, "names")= chr [1:24456] "ENSG00000169247" "ENSG00000130881" "ENSG00000087116" "ENSG00000128045" ...
..@ keytype : chr "UNKNOWN"
..@ permScores : num[0 , 0 ]
..@ params :List of 6
.. ..$ pvalueCutoff : num 0.05
.. ..$ eps : num 1e-10
.. ..$ pAdjustMethod: chr "BH"
.. ..$ exponent : num 1
.. ..$ minGSSize : num 10
.. ..$ maxGSSize : num 500
..@ gene2Symbol: chr(0)
..@ readable : logi FALSE
..@ termsim : num[0 , 0 ]
..@ method : chr(0)
..@ dr : list()
Because we used Ensembl IDs, we may be interested in viewing the output as gene symbols. We can translate the gene IDs to gene symbols using setReadable()
.
eh_gs <- setReadable(eh, OrgDb = org.Hs.eg.db, keyType="ENSEMBL")
eh_gs
#
# Gene Set Enrichment Analysis
#
#...@organism UNKNOWN
#...@setType UNKNOWN
#...@keytype ENSEMBL
#...@geneList Named num [1:24456] 14.9 14.2 14 13.7 13.7 ...
- attr(*, "names")= chr [1:24456] "ENSG00000169247" "ENSG00000130881" "ENSG00000087116" "ENSG00000128045" ...
#...nPerm
#...pvalues adjusted by 'BH' with cutoff <0.05
#...18 enriched terms found
'data.frame': 18 obs. of 11 variables:
$ ID : chr "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION" "HALLMARK_ALLOGRAFT_REJECTION" "HALLMARK_UV_RESPONSE_DN" "HALLMARK_INTERFERON_GAMMA_RESPONSE" ...
$ Description : chr "Genes defining epithelial-mesenchymal transition, as in wound healing, fibrosis and metastasis." "Genes up-regulated during transplant rejection." "Genes down-regulated in response to ultraviolet (UV) radiation." "Genes up-regulated in response to IFNG [GeneID=3458]." ...
$ setSize : int 191 174 143 191 186 194 178 189 169 181 ...
$ enrichmentScore: num 0.597 -0.638 0.519 -0.508 0.454 ...
$ NES : num 2.26 -2.14 1.92 -1.72 1.71 ...
$ pvalue : num 1.00e-10 1.00e-10 1.44e-07 2.02e-06 2.11e-06 ...
$ p.adjust : num 2.50e-09 2.50e-09 2.41e-06 2.11e-05 2.11e-05 ...
$ qvalue : num 1.53e-09 1.53e-09 1.47e-06 1.29e-05 1.29e-05 ...
$ rank : num 3277 2156 3044 5160 2832 ...
$ leading_edge : chr "tags=50%, list=13%, signal=43%" "tags=36%, list=9%, signal=33%" "tags=31%, list=12%, signal=28%" "tags=48%, list=21%, signal=38%" ...
$ core_enrichment: chr "DCN/BGN/MGP/HTRA1/LUM/COL6A2/MMP2/GAS1/LOXL1/PMP22/DAB2/CCN1/THBS1/NNMT/MMP3/POSTN/TIMP3/SERPINE2/COL6A3/COL5A1"| __truncated__ "IL4/LYN/TNF/WAS/CRTAM/IL1B/CFP/HLA-DRA/NLRP3/ZAP70/TRAT1/IL13/LCP2/CCL19/IL10/IGSF6/SPI1/MAP4K1/EREG/NCF4/CD2/C"| __truncated__ "IGFBP5/MGMT/PMP22/DAB2/CCN1/SDC2/PTGFR/SNAI2/HAS2/SMAD3/COL3A1/CAP2/MET/DLC1/ADORA2B/NFIB/IRS1/COL1A2/PTPN21/ID"| __truncated__ "STAT1/AUTS2/IFI35/RIPK2/UBE2L6/IFIT2/PFKP/CXCL10/PSMB10/BTG1/PELI1/PDE4B/LAP3/IL4R/STAT4/GCH1/NLRC5/IL15/CASP3/"| __truncated__ ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
Within our @result
, we have:
ID
- The term ID (gene set ID from MSigDB).
Description
- A description of the term ID.
setSize
- size of the pathway after removing genes not present in names(GeneList)
.
enrichmentScore
- Same as in Broad GSEA implementation. The degree to which a set is over-represented at the top or bottom of the ranked list (negative = bottom, positive = top).
NES
- allows comparability across gene sets by accounting for differences in gene set size and in correlations between gene sets and the expression dataset.
pvalue
- Significance calculation (p-value) using a permutation test on gene labels.
p.adjust
- a BH-adjusted p-value.
qvalue
- another take on FDR adjusted p-values; See ?qvalue::qvalue
.
rank
- The position in the ranked list at which the maximum enrichment score occurred.
leading_edge
-
- tags - the percentage of genes contributing to the enrichment score
- list - where in the list the enrichment score is attained
- signal - enrichment signal strength
core_enrichment
- Genes that contribute to the leading-edge subset within the gene set. This is the subset of genes that contributes most to the enrichment result.
glimpse(eh_gs@result)
Rows: 18
Columns: 11
$ ID <chr> "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION", "HALLMAR…
$ Description <chr> "Genes defining epithelial-mesenchymal transition, as …
$ setSize <int> 191, 174, 143, 191, 186, 194, 178, 189, 169, 181, 106,…
$ enrichmentScore <dbl> 0.5970040, -0.6376018, 0.5191196, -0.5078924, 0.454363…
$ NES <dbl> 2.256140, -2.140916, 1.916448, -1.715399, 1.711498, -1…
$ pvalue <dbl> 1.000000e-10, 1.000000e-10, 1.444408e-07, 2.022417e-06…
$ p.adjust <dbl> 2.500000e-09, 2.500000e-09, 2.407347e-06, 2.112346e-05…
$ qvalue <dbl> 1.526316e-09, 1.526316e-09, 1.469749e-06, 1.289643e-05…
$ rank <dbl> 3277, 2156, 3044, 5160, 2832, 3073, 3683, 2769, 3261, …
$ leading_edge <chr> "tags=50%, list=13%, signal=43%", "tags=36%, list=9%, …
$ core_enrichment <chr> "DCN/BGN/MGP/HTRA1/LUM/COL6A2/MMP2/GAS1/LOXL1/PMP22/DA…
Check out more on the GSEA function used here.
GSEA with GO
As mentioned, clusterProfiler also includes GSEA functions for specific functional databases. For example, we can look at the enrichment of terms from the Gene Ontology Consortium (GO) using gseGO()
.
For this function, we need to provide our ranked genes (geneList
), the organism (OrgDb
) object, the type of identifier used for the genes (keyType
), and the GO namespace (one of "BP", "CC" and "MF"). keyType
and the OrgDb
object allow gseGO()
to convert IDs as needed. Learn more about the organism annotation packages here.
#get keytype info for org.Hs.eg.db
keytypes(org.Hs.eg.db)
[1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
[6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
[11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
[16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
[21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
[26] "UNIPROT"
#run gsea
gsea_go <- gseGO(geneList = ranked,
OrgDb = org.Hs.eg.db,
ont = "BP",
keyType = "ENSEMBL",
by="fgsea")
using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).
preparing geneSet collections...
GSEA analysis...
Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (1.95% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some of the pathways the P-values were likely overestimated. For
such pathways log2err is set to NA.
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some pathways, in reality P-values are less than 1e-10. You can
set the `eps` argument to zero for better estimation.
leading edge analysis...
done...
gsea_go
#
# Gene Set Enrichment Analysis
#
#...@organism Homo sapiens
#...@setType BP
#...@keytype ENSEMBL
#...@geneList Named num [1:24456] 14.9 14.2 14 13.7 13.7 ...
- attr(*, "names")= chr [1:24456] "ENSG00000169247" "ENSG00000130881" "ENSG00000087116" "ENSG00000128045" ...
#...nPerm
#...pvalues adjusted by 'BH' with cutoff <0.05
#...649 enriched terms found
'data.frame': 649 obs. of 11 variables:
$ ID : chr "GO:0050853" "GO:0035107" "GO:0035108" "GO:0050851" ...
$ Description : chr "B cell receptor signaling pathway" "appendage morphogenesis" "limb morphogenesis" "antigen receptor-mediated signaling pathway" ...
$ setSize : int 70 132 132 185 107 107 418 165 165 205 ...
$ enrichmentScore: num -0.794 0.64 0.64 -0.684 0.638 ...
$ NES : num -2.38 2.33 2.33 -2.3 2.26 ...
$ pvalue : num 1.00e-10 1.00e-10 1.00e-10 1.00e-10 1.02e-10 ...
$ p.adjust : num 1.03e-08 1.03e-08 1.03e-08 1.03e-08 1.03e-08 ...
$ qvalue : num 8.52e-09 8.52e-09 8.52e-09 8.52e-09 8.52e-09 ...
$ rank : num 3377 3042 3042 3436 4025 ...
$ leading_edge : chr "tags=67%, list=14%, signal=58%" "tags=42%, list=12%, signal=37%" "tags=42%, list=12%, signal=37%" "tags=42%, list=14%, signal=36%" ...
$ core_enrichment: chr "ENSG00000111679/ENSG00000171608/ENSG00000110031/ENSG00000087088/ENSG00000197943/ENSG00000137101/ENSG00000081189"| __truncated__ "ENSG00000163064/ENSG00000006377/ENSG00000120149/ENSG00000105880/ENSG00000162878/ENSG00000066468/ENSG00000151067"| __truncated__ "ENSG00000163064/ENSG00000006377/ENSG00000120149/ENSG00000105880/ENSG00000162878/ENSG00000066468/ENSG00000151067"| __truncated__ "ENSG00000182389/ENSG00000213658/ENSG00000111679/ENSG00000107485/ENSG00000171608/ENSG00000110031/ENSG00000131981"| __truncated__ ...
#...Citation
S Xu, E Hu, Y Cai, Z Xie, X Luo, L Zhan, W Tang, Q Wang, B Liu, R Wang, W Xie, T Wu, L Xie, G Yu. Using clusterProfiler to characterize multiomics data. Nature Protocols. 2024, 19(11):3292-3320
Plot GSEA results
enrichplot
offers several functions specific to GSEA output, especially in regard to visualizing the running score and pre-ranked list (gseaplot()
and gseaplot2()
). gseaplot2()
allows us to visualize several enriched gene sets at once.
enrichplot::gseaplot2(eh_gs, geneSetID = c(1,3,5))
enrichplot::gseaplot2(eh_gs, geneSetID = c(2,4))
We can also use ggplot2 helpers to easily customize figures, or you can save the results to a data frame and build any custom figure.
#plot using a barplot
ggplot(eh_gs, showCategory=10, aes(NES, fct_reorder(Description, NES),
fill=p.adjust)) +
geom_col() +
geom_vline(xintercept=0, linetype="dashed", color="blue",size=1)+
scale_fill_gradientn(colours=c("#b3eebe","#46bac2", "#371ea3"),
guide=guide_colorbar(reverse=TRUE))+
scale_x_continuous(expand=c(0,0))+
theme_bw() +
xlab("Normalized Enrichment Score") +
ylab(NULL) +
ggtitle("GSEA (Hallmark)")
See further options for visualization here and here.