Secondary and Downstream Analysis with Multiple Samples in Seurat

Modified

March 26, 2026

1. Introduction and Learning Objectives

In previous lessons, we explored analyzing single samples and merged samples in Seurat. In this lesson, we will focus on performing secondary and downstream analyses with multiple samples in Seurat.

Through the course of this tutorial, we will explore:

Extend preliminary annotation of cell types and how to use them in evaluating secondary clustering
Combine samples and explore how to use and evaluate different methods of batch correction.
Perform preliminary differential expressionanalysis of conditions, based on cell types and clusters
- This will also explore how to identify and prepare subsets of cells
Additional tools that can be explored at your leisure, including pathway analysis, trajectory analysis, CNV inference, and more.

1.1 Data

We will be using a subset of the data published by Morrisey et al. in 2025 that explored lung tissue in mice recovering from viral infection similar to the human SARS-CoV-2 virus. The data we will examine will look at 2 replicates of mice at homeostasis in early infection and 2 replicates of mice 6 weeks after infections that were treated with tamoxifen at 2 weeks post-infection.

The original datasets have been included in .h5 counts formats, which can be processed for filtering and count normalization from scratch. This tutorial will be using the processed datasets, which have been filtered, normalized, and clustered. These are present with the .rds extension.

1.2 References and Packages

Reference vignettes: The underlying code of this tutorial is adapted from the following vignettes:

Packages: The following R packages will need to be installed to run the code in this tutorial:

Seurat (version 5.0 or higher)
SeuratWrappers
ggplot2
harmony
SingleR
celldex
MAST
presto

# Base packages for Seurat and visualization
library(Seurat)
library(SeuratWrappers)
library(ggplot2)

# One option for batch correction
library(harmony)

# Packages for cell type annotation
library(SingleR)
library(celldex)

# Packages for differential expression
library(MAST)
library(presto) 

#Packages for secondary/tertiary analysis
library(GSEABase)
library(fgsea)
library(monocle3)

Optional packages that can be explored for additional analyses include:

GSEABase
monocle3
infercnv
CopyKAT
SCEVAN

Tutorial Scope and Limitations

This tutorial is a demonstration of a potential series of analysis steps that can be performed with single cell RNASeq data. It is not intended to be a comprehensive or definitive guide and should not be considered an endorsement of the tools being utilized, as many other tools can be used to perform similar analyses and may be more appropriate for different datasets that may be encountered. The code provided is intended to be a starting point for users to explore these tools and analyses, and should be adapted and modified as needed for specific datasets and research questions.

2. Preliminary Annotation of Cell Types and Evaluating Secondary Clustering

Cell type annotation is a common step in single-cell RNA-seq analysis that can assist in interpreting the biological significance of clusters, as well as providing a window into the distribution of cell types collected in the analysis. In this section, SingleR will be used as the primary tool, but we will also explore how manual annotation can be performed should the annotation references be insufficient.

SingleR is a tool that uses correlation to determine the most likely cell type for each cell or cluster based on reference datasets. It is most commonly used with its partner package celldex, which has a variety of mouse and human reference datasets. SingleR operates most efficiently when used on smaller datasets, but it can be completed on larger datasets, given enough time. This demonstration will be performed on a single sample, but all the samples provided for this tutorial have undergone similar analyses.

2.1 Using `SingleR` for Cell Type Annotation with a Standard Reference Dataset.

Start by importing one of the sample datasets:

seurat_obj <- readRDS("../data/homeostasis_d03_rep1_processed.rds")

SingleR is built to use a different data structure that has been optimized for single cell RNASeq analysis, called SingleCellExperiment. To use SingleR, we will need to convert our Seurat object into a SingleCellExperiment object. This can be done with the as.SingleCellExperiment() function from the SeuratWrappers package.

sce_obj = as.SingleCellExperiment(seurat_obj, assay = "SCT")

Here, we will use the MouseRNAseqData reference dataset from the celldex package, which is a general reference derived bulk RNA-seq obtained from multiple tissue sites. This reference dataset is not ideal for our lung tissue samples, but it will serve as a demonstration of how to use SingleR with a standard reference dataset. We will run the SingleR function in order to annotate the cell types, and then append the results to the metadata of our Seurat object for visualization.

#Load the reference dataset
ref <- celldex::MouseRNAseqData()
#Run SingleR
annotations <- SingleR::SingleR(test = sce_obj, ref = ref, labels = ref$label.main)
#Append the annotations to the metadata of the Seurat object
seurat_obj$mouseRNASeq_main <- annotations$pruned.labels

Direct Outputs of SingleR

SingleR produces a data frame containing the scores of each cell against each potential reference cell type, as well as the preliminary annnotation and a “pruned” annotation that is the result of a pruning step that removes annotations that are not significantly different from the next best annotation. The pruned.labels are the most commonly used output, but the other outputs can be useful for evaluating the results of the annotation and identifying potential misannotations. More details are available in the vignette.

Reference Labels in celldex

The celldex references include a variety of labels, including label.main, label.fine, and label.ont. The label.main labels are the most general, while the label.fine includes specific details, including the cell subtype labels (e.g. “Cd8+ effector T cells”) and occastionally the experiment from which the reference was derived. The label.ont labels split the difference by using cell ontology to define specific cell subtypes without differentiating between reference sources. However, label.ont will require a “translation” from ontology labels to cell type labels, which can be performed with packages such as ontoProc.

We can now visualize the results of the SingleR annotation by plotting with the DimPlot function from Seurat:

DimPlot(seurat_obj, group.by = "mouseRNASeq_main", label= TRUE, repel = TRUE) + 
  ggtitle("SingleR Annotation with MouseRNAseq Reference")

As you can see, the SingleR annotation with the MouseRNAseqData reference has some clear mislabeling, such as the inclusion of hepatocytes and kidney cells, which are not expected in lung tissue. This highlights the importance of using an appropriate reference dataset for annotation, as well as the need to evaluate the results of annotation critically. In the next section, we will explore how to use SingleR with a custom reference dataset that is more appropriate for our lung tissue samples.

2.2 Using `SingleR` for Cell Type Annotation with Custom Reference Datasets

As the MouseRNASeqData annotation demonstrated, a more precise dataset may be required for prroper cell type labeling. Here, we have included a reference obtained from the Tabula Muris database that is lung-specific, which should provide a more accurate annotation of cell types in our lung tissue samples.

Preparing Custom References for SingleR

When preparing a custom reference for SingleR, the reference will need to be in the form of a SingleCellExperiment object, with the counts log-normalized and scaled, as indicated in the vignette. The reference used in this example has already been prepared for usage. A miniature script has been included in the R folder specifically to convert a downloaded Tabula Muris reference to a SingleCellExperiment object that can be used with SingleR.

#Load the custom reference dataset
lung_ref <- readRDS("../data/tabulaMuris_lung_sce.rds")

#Peek at the labels of lung_ref and check that they are sufficient
names(lung_ref@colData) #Note that the metadata is under a different label

 [1] "orig.ident"          "channel"             "tissue"             
 [4] "subtissue"           "mouse.sex"           "mouse.id"           
 [7] "percent.ercc"        "percent.ribo"        "free_annotation"    
[10] "cell_ontology_class" "res.3"               "cluster.ids"        
[13] "cell_ontology_id"    "nCount_RNA"          "nFeature_RNA"       
[16] "ident"

table(lung_ref$cell_ontology_class)


                            alveolar macrophage 
                                            345 
                                         B cell 
                                            204 
ciliated columnar cell of tracheobronchial tree 
                                             49 
                             classical monocyte 
                                            161 
                                      leukocyte 
                                            152 
                          lung endothelial cell 
                                            462 
                                      mast cell 
                                             24 
                                   myeloid cell 
                                             87 
                            natural killer cell 
                                            824 
                         non-classical monocyte 
                                            220 
                                   stromal cell 
                                           2540 
                                         T cell 
                                            247 
                             type II pneumocyte 
                                             89

#Run SingleR with the custom reference
annotations_lung <- SingleR::SingleR(test = sce_obj, ref = lung_ref, labels = lung_ref$cell_ontology_class)

#Append the annotations to the metadata of the Seurat object  
seurat_obj$lung_ref_annotation <- annotations_lung$pruned.labels

In comparing the results of the SingleR annotation with the MouseRNAseqData reference and the Tabula Muris Lung reference, we can see that the Tabula Muris Lung reference provides a much more accurate annotation of cell types in our lung tissue samples, with labels that are more consistent with what we would expect to find in lung tissue.

DimPlot(seurat_obj, group.by = "mouseRNASeq_main", label= TRUE, repel = TRUE) + 
  ggtitle("SingleR Annotation with MouseRNAseq Reference")

DimPlot(seurat_obj, group.by = "lung_ref_annotation", label= TRUE, repel = TRUE) + 
  ggtitle("SingleR Annotation with Tabula Muris Lung Reference")

datatable(as.data.frame.matrix(table(seurat_obj$mouseRNASeq_main,seurat_obj$lung_ref_annotation)))

2.3 Using `SingleR` for Cluster-Based Annotation

In the event that a larger dataset is being analyzed, or if the cell-level annotation is producing more variety in labels within clusters than expected, it might be useful to perform a cluster-based annotation. In this procedure, the average expression profileof the cluster is used as the input, and the annotation is then assigned to all cells within the cluster. This will also be demonstrated in section 2.5 with the manual annotation of cell types, but here we will demonstrate how to perform this with SingleR and the Tabula Muris Lung reference.

cluster_annotations <- SingleR(test = sce_obj, ref = lung_ref, labels = lung_ref$cell_ontology_class, clusters = seurat_obj$seurat_clusters)

#Explore the results of the cluster-based annotation
print(cluster_annotations)

DataFrame with 20 rows and 4 columns
                            scores                labels delta.next
                          <matrix>           <character>  <numeric>
0   0.391418:0.368112:0.398721:... lung endothelial cell  0.2511680
1   0.384327:0.316892:0.446443:...          stromal cell  0.2356921
2   0.430774:0.406465:0.442655:... lung endothelial cell  0.2157956
3   0.593204:0.456331:0.232646:...    classical monocyte  0.0818991
4   0.378086:0.356728:0.370577:... lung endothelial cell  0.2010205
...                            ...                   ...        ...
15  0.382142:0.323154:0.513362:...    type II pneumocyte   0.216025
16  0.425155:0.349549:0.484323:...          stromal cell   0.145813
17  0.411376:0.330065:0.444796:...          stromal cell   0.190931
18  0.434307:0.382551:0.426494:... lung endothelial cell   0.114251
19  0.449521:0.422656:0.288217:...             mast cell   0.118768
            pruned.labels
              <character>
0   lung endothelial cell
1            stromal cell
2   lung endothelial cell
3      classical monocyte
4   lung endothelial cell
...                   ...
15     type II pneumocyte
16           stromal cell
17           stromal cell
18  lung endothelial cell
19              mast cell

#Match the cluster annotions to the clusters defined in the Seurat object and append to metadata
seurat_obj$lung_cluster_annotation <- cluster_annotations$pruned.labels[match(seurat_obj$seurat_clusters, rownames(cluster_annotations))]

#Visualize the cluster-based annotation 
clustPlot <- DimPlot(seurat_obj, group.by = "seurat_clusters", label= TRUE, repel = TRUE) + 
  ggtitle("Seurat Clusters")
clustAnnotPlot <- DimPlot(seurat_obj, group.by = "lung_cluster_annotation", label= TRUE, repel = TRUE) + 
  ggtitle("SingleR Cluster-Based Annotation with Tabula Muris Lung Reference")
ggarrange(clustPlot, clustAnnotPlot, ncol = 1, nrow = 2)

datatable(as.data.frame.matrix(table(seurat_obj$lung_ref_annotation,seurat_obj$lung_cluster_annotation)))

Note that the cluster-based annotations are more concentrated, as a result of collapsing the expression profiles. Depending on the dataset, this may be preferable to having a more granular cell-based annotation, but it is important to evaluate if this also masks important variation that may be otherwise visible.

2.4 Manual Annotation of Cell Types

Manual annotation of cell types can be performed by identifying marker genes for clusters and then using the expression of those marker genes to assign cell type labels to clusters. This can be done with the FindAllMarkers function in Seurat, which identifies marker genes for each cluster based on differential expression analysis. The resulting marker genes can then be used to assign cell type labels to clusters based on known markers for different cell types. In a mini example here, we will use known markers for B cells (Cd19), T cells (Cd3e), monocytes (Itgam [CD11b]), leukocytes (Ptprc [CD45]), endothelial cells (Pecam1), and stromal cells (Pdgfra) to perform a preliminary annotation.

#Set identity as base clusters
Idents(seurat_obj) <- "seurat_clusters"

gene_markers <- c("Cd19","Cd3e","Itgam","Ptprc","Pecam1","Pdgfra")
#Plot genes individually on UMAP
FeaturePlot(seurat_obj,features = gene_markers, ncol = 2, label = TRUE)

#Compare expression levels with violin plots
VlnPlot(seurat_obj,features=gene_markers, ncol=2)

#Compare average expressions in tabular form
as.matrix(AverageExpression(seurat_obj,assay="SCT",features=gene_markers)$SCT)

As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
First group.by variable `ident` starts with a number, appending `g` to ensure valid variable names
This message is displayed once per session.

                 g0          g1          g2          g3          g4         g5
Cd19   0.0000000000 0.000000000 0.003289474 0.034013605 0.000000000 0.00000000
Cd3e   0.0000000000 0.000000000 0.003289474 0.000000000 0.000000000 0.00000000
Itgam  0.0008680556 0.000000000 0.003289474 1.159863946 0.008403361 0.00000000
Ptprc  0.0034722222 0.008108108 0.036184211 4.897959184 0.025210084 0.02777778
Pecam1 3.6206597222 0.016216216 2.575657895 0.170068027 2.063025210 5.24074074
Pdgfra 0.0121527778 3.762162162 0.046052632 0.003401361 0.042016807 0.01388889
               g6          g7          g8         g9     g10        g11   g12
Cd19   0.91457286 0.000000000 0.000000000 0.00000000 0.00625 0.00000000 0.000
Cd3e   0.01005025 0.000000000 0.000000000 1.39880952 0.00000 0.00000000 0.000
Itgam  0.00000000 0.000000000 1.546511628 0.22023810 0.00000 0.00000000 0.000
Ptprc  2.49748744 0.005586592 6.941860465 5.49404762 0.04375 0.07586207 0.024
Pecam1 0.44221106 0.083798883 0.069767442 0.40476190 0.24375 0.39310345 0.472
Pdgfra 0.00000000 0.011173184 0.005813953 0.01190476 0.01875 0.21379310 1.376
              g13        g14        g15        g16       g17      g18       g19
Cd19   0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.000000 0.0000000
Cd3e   0.00000000 0.00000000 0.04411765 0.00000000 0.0000000 0.000000 0.0000000
Itgam  0.00000000 0.00000000 0.00000000 0.00000000 0.0000000 0.000000 0.3333333
Ptprc  0.01162791 0.00000000 0.04411765 0.00000000 0.0000000 0.000000 0.6666667
Pecam1 4.46511628 0.04166667 0.16176471 0.01492537 0.0483871 1.647059 0.6111111
Pdgfra 0.00000000 0.01388889 0.00000000 0.22388060 1.0806452 0.000000 0.0000000

#Assign labels and append to metadata
manual_annot <- vector(length = ncol(seurat_obj))
manual_annot[] <- NA
names(manual_annot) <- colnames(seurat_obj)
manual_annot[which(seurat_obj$seurat_clusters == 6)] <- "B cells"
manual_annot[which(seurat_obj$seurat_clusters == 9)] <- "T cells"
manual_annot[which(seurat_obj$seurat_clusters == 3)] <- "Monocytes"
manual_annot[which(seurat_obj$seurat_clusters == 8)] <- "Leukocytes"
manual_annot[which(seurat_obj$seurat_clusters %in% c(0,2,4,5,13))] <- "Endothelial cells"
manual_annot[which(seurat_obj$seurat_clusters == 1)] <- "Stromal cells"

seurat_obj$manual_annotation <- manual_annot
DimPlot(seurat_obj, group.by = "manual_annotation", label= TRUE, repel = TRUE) + 
  ggtitle("Manual Annotation of Cell Types")

Note that the manual annotation is dependent on how trustworthy the marker genes are for the cell types being annotated. The following examples highlight some of the limitations of this approach: 1. Ptprc is a more general catch-all marker for immune cells, and as such, is used to define leukocytes on an exclusionary basis, i.e. if they are not monocytes or B cells, leukocytes are what remains. 2. The stromal cell annotation is not as broad as in the SingleR annotations, which is likely due to the fact that the SingleR annotation is based on a reference dataset that includes a variety of stromal cell types, while the manual annotation is based on a single marker gene that may not capture the full diversity of stromal cells in the dataset. 3. Several cell clusters remain unannotated since we did not provide a marker gene for reference.

Another approach that can be seen below is to use FindAllMarkers to identify marker genes in the clusters, which can then be referenced against known markers for different cell types.

2.5 What Happens with a Bad Reference Dataset?

To demonstrate the importance of using an appropriate reference dataset, we can run SingleR with a reference dataset that is not relevant to our samples. In this example, we will use the HumanPrimaryCellAtlasData reference from celldex, which is derived from human primary cells and is not specific to mouse cells.

#Load the HPCA reference
hpca <- celldex::HumanPrimaryCellAtlasData()

#Run SingleR with the HPCA reference
annotations_hpca <- SingleR::SingleR(test = sce_obj, ref = hpca, labels = hpca$label.main)

#Append the annotations to the metadata of the Seurat object
seurat_obj$hpca_annotation <- annotations_hpca$pruned.labels

#Visualize the results of the HPCA annotation
DimPlot(seurat_obj, group.by = "hpca_annotation", label= TRUE, repel = TRUE) + 
  ggtitle("SingleR Annotation with HPCA Reference")

Since there are very few genes that have the same name in both human and mouse genome references, including capitalization, and the annotation is based on matching expression matrices based on gene names, the resulting annotation is extremely poor. SingleR diagnostics can provide more insight into the degree of failure, but it is obvious that this result is useless: a murine lung sample should not contain approximately 80% gametocytes.

3. Combining Samples and Batch Correction

This section will explore how to combine samples together for analysis, as well as how to perform batch correction to account for technical variation between samples. This is an important step in the analysis of single cell RNA-seq data, as it allows for the comparison of different conditions and the identification of shared and unique cell types across samples. This is being highlighted in particular to provide examples of the usage in the latest version of Seurat (version 5.0), which has some changes in the way that data is merged and batch corrected compared to previous versions. The code provided here is adapted from the Seurat integration vignette, which provides a comprehensive guide to the different methods of merging and batch correction available in Seurat.

3.1 Combining Samples with `merge()`

The first step in combining samples is to merge the Seurat objects together. This can be done with the merge() function, which will combine the count matrices and metadata of the different samples into a single Seurat object. The latest version of Seurat (version 5.0) uses a similar set up for merging samples as in previous versions, but

Memory allocation in single cell RNASeq projects

Many of the steps used in both sample combination and downstream differential expression can be very memory-intensive, and on some computers, this could lead to errors or warnings that look like this:

Found 8 SCT models. Recorrecting SCT counts using minimum median counts: 8146
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 58s
Error: vector memory limit of 16.0 Gb reached, see mem.maxVSize()
Error during wrapup: vector memory limit of 16.0 Gb reached, see mem.maxVSize()
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

This particular error indicates that there isn’t enough virtual memory allocated to run a process (in this case, PrepSCTFindMarkers). This can be corrected by adjusting the memory that R is allowed to access.

mem.maxVSize()
#[1] 16384
mem.maxVSize(vsize=16384*4)
#[1] 65536

#Increase memory limits
mem.maxVSize()

[1] 18432

mem.maxVSize(vsize=16384*4)

[1] 65536

#Import all processed samples
homeostasis_d03_rep1 <- readRDS("../data/homeostasis_d03_rep1_processed.rds")
homeostasis_d03_rep2 <- readRDS("../data/homeostasis_d03_rep2_processed.rds")
postinfection_d42_rep1 <- readRDS("../data/tam14_sac42_rep1_processed.rds")
postinfection_d42_rep2 <- readRDS("../data/tam14_sac42_rep2_processed.rds")

#Organize the samples into a list for merging
obj_list = list(homeostasis_d03_rep1, homeostasis_d03_rep2, postinfection_d42_rep1, postinfection_d42_rep2)

#Select variable features across samples
selectFeatures <- SelectIntegrationFeatures(object.list = obj_list, nfeatures = 3000)

#Merge the Seurat objects and assign variable features
merged_obj <- merge(obj_list[[1]], obj_list[2:length(obj_list)])

Warning: Some cell names are duplicated across objects provided. Renaming to
enforce unique cell names.

VariableFeatures(merged_obj) <- selectFeatures

#Dimensionality reduction and clustering on the merged object
so_merged <- RunPCA(merged_obj, verbose = FALSE) #Note the change in object name
so_merged <- FindNeighbors(so_merged,dims=1:50)

Computing nearest neighbor graph

Warning: package 'future' was built under R version 4.5.2

Computing SNN

so_merged <- RunUMAP(so_merged, dims=1:50, reduction = "pca")

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session

12:57:26 UMAP embedding parameters a = 0.9922 b = 1.112

12:57:26 Read 16348 rows and found 50 numeric columns

12:57:26 Using Annoy for neighbor search, n_neighbors = 30

12:57:26 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
12:57:27 Writing NN index file to temp file /var/folders/g1/x3w540zj3mjfvcl6lt72wj55w7mghm/T//RtmpfS2bPU/file73295492392a
12:57:27 Searching Annoy index using 1 thread, search_k = 3000
12:57:29 Annoy recall = 100%
12:57:30 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
12:57:31 Initializing from normalized Laplacian + noise (using RSpectra)
12:57:31 Commencing optimization for 200 epochs, with 700796 positive edges
12:57:31 Using rng type: pcg
12:57:36 Optimization finished

#Assign treatment groups
so_merged$group <- "Homeostasis_D3"
so_merged$group[grep("tam", so_merged$sample)] <- "Treated_D14"

#Take a look at the metadata columns available for visualization
colnames(so_merged@meta.data)

 [1] "orig.ident"            "nCount_RNA"            "nFeature_RNA"         
 [4] "sample"                "percent.mt"            "nCount_SCT"           
 [7] "nFeature_SCT"          "SCT_snn_res.0.5"       "seurat_clusters"      
[10] "MouseRNASeq_main"      "lung_tm_cell_ontology" "group"

#Visualize the outputs by sample, group, and cell ontology 
DimPlot(so_merged, group.by = "sample")

DimPlot(so_merged, group.by = "group")

DimPlot(so_merged, group.by = "lung_tm_cell_ontology", label = TRUE)

When visualizing the samples and the groups, we can see that they are very clearly mixed, which, when combined with the observation that the cells are separated by cell type annotation, indicates that the data is clean, and may not require greater batch correction. There are some separations based on treatment group more than sample, which may actually be an indicator of biological variation more than batch effect, depending on the experiment.

Note

REMEMBER THE BIOLOGY! The underlying biology of the experiment is critical in determining the necessity and application of batch correction. When running the experiment, it is on the user to decide if the data “makes sense.” This can look like clear separations in cell populations can be observed, but may or may not actually exist in the biological system.

3.2 Batch Correction with the `IntegrateLayers` function

The latest version of Seurat has updated its syntax to use Layers instead of slots, which is used under the hood to streamline some of the code that accesses the data. Consequently, integration (Seurat’s in-house term for batch correction) has now been reduced to one line, when previously it was a multistep process. We will look at two different approaches for batch correction here, and compare the outputs to the uncorrected data.

The first approach is Canonical Correlation Analysis, or CCA, which has been best known as the default method in Seurat with the IntegrateData function:

so_cca <- IntegrateLayers(object = so_merged, method = "CCAIntegration",  orig.reduction = "pca", new.reduction = "cca", normalization.method = "SCT")

Finding all pairwise anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

    Found 9747 anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

    Found 7991 anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

    Found 8585 anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

    Found 8298 anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

    Found 9160 anchors

Running CCA

Merging objects

Finding neighborhoods

Finding anchors

    Found 8016 anchors

Merging dataset 3 into 2

Extracting anchors for merged samples

Finding integration vectors

Finding integration vector weights

Integrating data

Merging dataset 4 into 2 3

Extracting anchors for merged samples

Finding integration vectors

Finding integration vector weights

Integrating data

Merging dataset 1 into 2 3 4

Extracting anchors for merged samples

Finding integration vectors

Finding integration vector weights

Integrating data

The second method is Harmony, which has been steadily increasing in usage in single cell RNASeq:

so_harmony <- IntegrateLayers(object = so_merged, method = "HarmonyIntegration", orig.reduction = "pca", new.reduction = "harmony", normalization.method = "SCT")

The `features` argument is ignored by `HarmonyIntegration`.
Transposing data matrix

Using automatic lambda estimation

Initializing state using k-means centroids initialization

Harmony 1/10

Harmony 2/10

Harmony 3/10

Harmony converged after 3 iterations

This message is displayed once per session.

Parameters to consider in integration

The vignette at the top of this page for integration goes into greater detail, but it should be briefly noted that when SCTransform is used as the normalization method, it should be specifically called out in the function. Additionally, it is strongly recommended to take advantage of the new.reduction parameter to create the new dimensionality reduction slot under the provided name, which will make downstream visualization more user-friendly.

Note that a single line of code is now all that is required to run the initial batch correction. The next section will write a function to complete the dimensionality reduction which can be used for visualization.

# This function will take in the Seurat object and the name of the reduction we created above and return the object with a new UMAP and one degree of clustering via the Louvain algorithm on the combined object
dim_reduc = function(seurat_obj, reduction_name){
  seurat_obj <- FindNeighbors(seurat_obj, reduction = reduction_name, dims=1:50)
  seurat_obj <- FindClusters(seurat_obj, resolution = 0.6, algorithm = 1)
  seurat_obj <- RunUMAP(seurat_obj, reduction = reduction_name, dims = 1:50)
  return(seurat_obj)
}

so_cca <- dim_reduc(so_cca, "cca")

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 16348
Number of edges: 633278

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9269
Number of communities: 30
Elapsed time: 1 seconds

13:03:56 UMAP embedding parameters a = 0.9922 b = 1.112

13:03:56 Read 16348 rows and found 50 numeric columns

13:03:56 Using Annoy for neighbor search, n_neighbors = 30

13:03:56 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
13:03:57 Writing NN index file to temp file /var/folders/g1/x3w540zj3mjfvcl6lt72wj55w7mghm/T//RtmpfS2bPU/file732913149739
13:03:57 Searching Annoy index using 1 thread, search_k = 3000
13:03:59 Annoy recall = 100%
13:04:00 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
13:04:01 Initializing from normalized Laplacian + noise (using RSpectra)
13:04:02 Commencing optimization for 200 epochs, with 713320 positive edges
13:04:02 Using rng type: pcg
13:04:06 Optimization finished

so_harmony <- dim_reduc(so_harmony, "harmony")

Computing nearest neighbor graph
Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 16348
Number of edges: 585297

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9285
Number of communities: 29
Elapsed time: 1 seconds

13:04:10 UMAP embedding parameters a = 0.9922 b = 1.112
13:04:10 Read 16348 rows and found 50 numeric columns
13:04:10 Using Annoy for neighbor search, n_neighbors = 30
13:04:10 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
13:04:11 Writing NN index file to temp file /var/folders/g1/x3w540zj3mjfvcl6lt72wj55w7mghm/T//RtmpfS2bPU/file73292cd428f8
13:04:11 Searching Annoy index using 1 thread, search_k = 3000
13:04:13 Annoy recall = 100%
13:04:13 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
13:04:14 Initializing from normalized Laplacian + noise (using RSpectra)
13:04:15 Commencing optimization for 200 epochs, with 707828 positive edges
13:04:15 Using rng type: pcg
13:04:19 Optimization finished

Visualizing the three plots side-by-side can help us determine when and if batch correction is warranted.

g1.1 <- DimPlot(so_merged,group.by="sample") + ggtitle("Merged: Samples")
g1.2 <- DimPlot(so_cca,group.by="sample") + ggtitle("CCA: Samples")
g1.3 <- DimPlot(so_harmony, group.by="sample") + ggtitle("Harmony: Samples")
ggarrange(g1.1,g1.2,g1.3,ncol=1,nrow=3)

g2.1 <- DimPlot(so_merged, group.by="group") + ggtitle("Merged: Treatment")
g2.2 <- DimPlot(so_cca,group.by="group") + ggtitle("CCA: Treatment")
g2.3 <- DimPlot(so_harmony, group.by="group") + ggtitle("Harmony: Treatment")
ggarrange(g2.1,g2.2,g2.3,ncol=1,nrow=3)

g3.1 <- DimPlot(so_merged, group.by="lung_tm_cell_ontology",label=T)+NoLegend() + ggtitle("Merged: Annotation")
g3.2 <- DimPlot(so_cca,group.by="lung_tm_cell_ontology",label=T)+NoLegend() + ggtitle("CCA: Annotation")
g3.3 <- DimPlot(so_harmony, group.by="lung_tm_cell_ontology",label=T)+NoLegend() + ggtitle("Harmony: Annotation")
ggarrange(g3.1,g3.2,g3.3,ncol=1,nrow=3)

Ironically, the visualizations of the two batch correction runs compared to the uncorrected run indicates that there is minimal change in response to the batch correction. This is particularly visible in the cell type annotation image, which shows a consistent clustering of cell types, compared to the widespread distribution of cells across samples. So what do we do with this information? In this particular case, we would probably continue our analysis with the uncorrected data. The general rule of thumb is to minimize any data manipulation so that we, as the users, can remain confident that the data is as “pure” as possible for the downstream analyses.

Compare this to the output of the Seurat vignette in clustering:

Before batch correction Uncorrected UMAPs of different scRNASeq technologies

After batch correction CCA and SCVI corrected UMAPs

It is apparent that in this vignette, the data is initially separated by the experimental technology, and batch correction appropriately reorganizes the cells into predicted cell types.

At this point, it is strongly recommended that the combined object is saved for future usage.

saveRDS(so_merged, "combined_object_noCorrection.rds")

What is batch correction used for?

The primary purpose for batch correction is dimensionality reduction and clustering. Batch correction ensures that similar cells are grouped together in the reduced dimensional space that is emphasized by the variable gene selection. It is NOT to be used as the basis for identifying differential genes, since the manipulation in the component space will alter the gene expression to the point where interpretation gets murky. The clusters that are derived from batch correction can be used to select the groups, but the differential expression analysis must be performed on the normalized counts, as demonstrated in the following section.

4. Differential Expression Analysis

Differential analysis is the core outcome of nearly every next generation sequencing project. In the context of scRNASeq, differential gene expression between different conditions or clusters of cells. We’ll explore two different applications here.

Always use the normalized counts for differential expression!

As emphasized in the previous section, the batch corrected values should not be used in the differential analysis, for several reasons: 1. The batch corrected values have undergone several transformations in scaling and normalization that may render the interpretations of the differential analysis unreliable. 2. Often, the batch correction is restricted to the most variable genes, which is normally a subset in the range of 3000 genes. Any genes outside this subset are not available for analysis. 3. Normalization methods, especially in Seurat, have been standardized and optimized for differential analysis and should not be further tinkered with, without a certain degree of expertise in genomics, mathematics, statistics, and quite possibly witchcraft.

4.1 Differential Expression Between Cell Types

In order to set up the Seurat object for differential expression, the parent condition of the comparison groups needs to be set as the primary identity. In this example, we will only be looking at differences between the two largest populations of cell types in the data set: lung endothelial cells and stromal cells. As discussed in the previous section, the downstream analysis will be conducted on the merged (i.e. uncorrected) data set.

#Set the identity of the Seurat object being analyzed
Idents(so_merged) <- "lung_tm_cell_ontology"
#Use Seurat's internal cross-sample preparation prior to differential expression
so_merged <- PrepSCTFindMarkers(so_merged)

Found 4 SCT models. Recorrecting SCT counts using minimum median counts: 2929

celltype_markers <- FindMarkers(so_merged, ident.1 = "lung endothelial cell", ident.2 = "stromal cell", test.use="wilcox")
head(celltype_markers, n=20)

        p_val avg_log2FC pct.1 pct.2 p_val_adj
Egfl7       0   5.342231 0.873 0.063         0
Cdh5        0   5.262672 0.887 0.081         0
Pecam1      0   5.292649 0.855 0.053         0
Adgrf5      0   4.668090 0.939 0.146         0
Calcrl      0   5.025127 0.922 0.130         0
Cldn5       0   5.360639 0.912 0.120         0
Cd36        0   5.203829 0.856 0.082         0
Ramp2       0   5.011474 0.866 0.099         0
Ctla2a      0   5.354077 0.838 0.075         0
Tspan7      0   5.544453 0.793 0.048         0
Bgn         0  -4.471981 0.101 0.843         0
Hpgd        0   5.322011 0.845 0.108         0
Ehd4        0   4.301481 0.881 0.149         0
Plac9a      0  -4.910593 0.068 0.799         0
Ptprb       0   5.060558 0.872 0.141         0
Cyyr1       0   6.089350 0.751 0.026         0
Slco2a1     0   5.256971 0.765 0.041         0
Cd93        0   5.691711 0.775 0.052         0
Gsn         0  -5.590359 0.165 0.883         0
Aqp1        0   4.039563 0.787 0.073         0

The results from the FindMarkers function in Seurat will follow the following conventions: Column 1: Gene name Column 2: Raw p-value Column 3: Average Log2 Fold Change Column 4: Percentage of cells in ident.1 expressing the gene at any level Column 5: Precentage of cells in ident.2 expressing the gene at any level Column 6: Adjusted p-value

It is not uncommon for the p-values to be extremely small, since each cell in the dataset is considered a replicate. At this level of granularity, the tests are comparing thousands of data points against each other, at which point the statistical tests tend to collapse. As such, identification of significant genes requires careful consideration of the fold change and level of expression, in addition to the adjusted p-value.

Differential expression options

In this example, we are using the Wilcoxon rank-sum test to identify differentially expressed genes. This is one of many differential expression algorithms that can be used; another common differential test is MAST, which is optimized for single cell analysis, but can also be very time- and memory-intensive. Other options can be explored by pulling up the help documentation for the FindMarkers function with the ?FindMarkers command.

If only ident.1 is defined, the differential expression will compare that identity to all other cells in the dataset. Another option that is popular is to use FindAllMarkers, which does not require defining the groups to be compared, but will take each category available under the identity and compare it to the rest of the dataset, thereby providing a list of distinct markers for each category.

4.2 Differential Expression Between Conditions

Occasionally, when exploring conditions, it is necessary to isolate the cell type being examined. This can be performed using the subset function. After subsetting, re-running some of the dimensionality reduction may assist in visual identification of separate clusters within the subset, as the isolation of the subset may remove other sources of variation. Here, we will look at the differences within stromal cells between the homeostasis samples and the treated samples.

endothelial_cells <- subset(so_merged,cells=colnames(so_merged)[which(so_merged$lung_tm_cell_ontology=="lung endothelial cell")])
#Rerun the FindNeighbors and RunUMAP functions
endothelial_cells <- FindNeighbors(endothelial_cells, dims=1:30)

Computing nearest neighbor graph

Computing SNN

endothelial_cells <- RunUMAP(endothelial_cells, dims=1:30)

13:05:20 UMAP embedding parameters a = 0.9922 b = 1.112

13:05:20 Read 8957 rows and found 30 numeric columns

13:05:20 Using Annoy for neighbor search, n_neighbors = 30

13:05:20 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

**************************************************|
13:05:20 Writing NN index file to temp file /var/folders/g1/x3w540zj3mjfvcl6lt72wj55w7mghm/T//RtmpfS2bPU/file7329e7f7c2
13:05:20 Searching Annoy index using 1 thread, search_k = 3000
13:05:21 Annoy recall = 100%
13:05:22 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
13:05:23 Initializing from normalized Laplacian + noise (using RSpectra)
13:05:23 Commencing optimization for 500 epochs, with 383756 positive edges
13:05:23 Using rng type: pcg
13:05:29 Optimization finished

Idents(endothelial_cells)<-"group"
DimPlot(endothelial_cells, group.by="group")

#`recorrect_umi` is set to FALSE because the PrepSCTFindMarkers should persist after subsetting.
endothelial_group_DE <- FindMarkers(endothelial_cells,ident.1 = "Treated_D14", ident.2 = "Homeostasis_D3", recorrect_umi=FALSE)
head(endothelial_group_DE,n=20)

                  p_val  avg_log2FC pct.1 pct.2     p_val_adj
Rps3a3     0.000000e+00 -12.4579760 0.000 0.427  0.000000e+00
Rps3a1     0.000000e+00   1.5816838 0.920 0.550  0.000000e+00
Rps24     1.005729e-274   0.9460648 0.973 0.893 1.943369e-270
mt-Nd3    1.143800e-238   1.3564315 0.802 0.475 2.210165e-234
Ubb       2.819430e-221   1.2624217 0.746 0.480 5.447984e-217
Rpl35     6.852486e-211   0.9719192 0.840 0.610 1.324106e-206
Uba52     1.423512e-210   1.7997187 0.494 0.201 2.750652e-206
Tpm3-rs7  1.290398e-185  -3.7848420 0.027 0.231 2.493436e-181
Rpl27a    3.530934e-180   0.8818543 0.826 0.602 6.822824e-176
mt-Rnr1   1.867320e-178   0.8951152 0.994 0.989 3.608222e-174
Tpt1      9.777175e-178   0.6671958 0.990 0.981 1.889243e-173
Rpl13     1.190797e-177   0.7316094 0.961 0.895 2.300977e-173
Hspa1a    3.063884e-171   1.4217246 0.550 0.264 5.920344e-167
Rps24-ps3 1.200293e-159  -6.9692478 0.001 0.153 2.319326e-155
Xist      4.042901e-156  -0.6969081 0.472 0.806 7.812097e-152
Cyp1a1    1.005678e-154  -4.8425950 0.012 0.173 1.943271e-150
mt-Rnr2   9.785608e-146   0.8285536 0.996 0.997 1.890873e-141
Rps23     9.169957e-145   0.6589460 0.954 0.897 1.771911e-140
Cd74      4.541927e-140   1.0964570 0.699 0.449 8.776366e-136
Rps21     5.813278e-134   0.6303147 0.951 0.903 1.123300e-129

4.3 Pseudobulk Differential Analysis

One option to avoid the p-value “inflation” described above is to perform pseudobulk differential analysis. As the name implies, pseudobulk attempts to simulate bulk RNASeq analysis by collapsing the reads in each replicate. The added advantage to this is that the scRNA results can be used as a filter of sorts, which for standard bulk RNASeq, can only be conducted through purification steps, such as if an analysis was aimed strictly at T cells. One potential drawback is that this will require replicates, much like bulk RNASeq, in order to generate any sort of statistical relevance. Here, we have two replicates per condition, which is not ideal, but it is enough for this demonstration. The process starts with AggregateExpression to total the read counts from each cell in the cohorts, and then we can run a standard bulk RNASeq DE tool, such as DESeq2 or limma-voom.

#As many identities needed for separation can be included in the `group.by` parameter
bulk_endothelial <- AggregateExpression(endothelial_cells,assays="RNA",return.seurat=TRUE,group.by=c("sample","group"))

Names of identity class contain underscores ('_'), replacing with dashes ('-')
This message is displayed once every 8 hours.

Idents(bulk_endothelial)<-"group"
bulk_endothelial_DE <- FindMarkers(bulk_endothelial, ident.1 = "Treated-D14", ident.2 = "Homeostasis-D3", test.use = "DESeq2", min.cells.group = 2) #The min.cells.group parameter has to be invoked to avoid an error for fewer than 3 replicates

converting counts to integer mode
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates

datatable(head(bulk_endothelial_DE, n=50))

datatable(head(endothelial_group_DE, n=50))

One of the more obvious results in comparing the pseudobulk DE and the standard scRNASeq DE is that the standard run produced a lot of ribosomal genes (beginning with “Rp”). Even so, we can see that many of the non-ribosomal genes are repeated, including elevation of Cd74, Hspa1a (HSP70), and Gbp2b in the treated group, all of which are indicators of an immune response.

We can delve further into the scRNASeq DE results by excluding the ribosomal genes with a regular expression:

head(endothelial_group_DE[-grep("^Rp",rownames(endothelial_group_DE)),], n=50)

                 p_val avg_log2FC pct.1 pct.2     p_val_adj
mt-Nd3   1.143800e-238  1.3564315 0.802 0.475 2.210165e-234
Ubb      2.819430e-221  1.2624217 0.746 0.480 5.447984e-217
Uba52    1.423512e-210  1.7997187 0.494 0.201 2.750652e-206
Tpm3-rs7 1.290398e-185 -3.7848420 0.027 0.231 2.493436e-181
mt-Rnr1  1.867320e-178  0.8951152 0.994 0.989 3.608222e-174
Tpt1     9.777175e-178  0.6671958 0.990 0.981 1.889243e-173
Hspa1a   3.063884e-171  1.4217246 0.550 0.264 5.920344e-167
Xist     4.042901e-156 -0.6969081 0.472 0.806 7.812097e-152
Cyp1a1   1.005678e-154 -4.8425950 0.012 0.173 1.943271e-150
mt-Rnr2  9.785608e-146  0.8285536 0.996 0.997 1.890873e-141
Cd74     4.541927e-140  1.0964570 0.699 0.449 8.776366e-136
Hspb1    8.708689e-131  1.4015048 0.523 0.285 1.682780e-126
Gbp2b    4.465077e-120 -6.6588416 0.001 0.118 8.627868e-116
Scgb1a1  3.406822e-117 -1.9133741 0.184 0.386 6.583002e-113
SiteB    2.016929e-105  2.2789523 0.328 0.151 3.897312e-101
Vim      2.495240e-105  0.6515400 0.926 0.844 4.821552e-101
Klf2     7.147482e-103  0.5530098 0.936 0.817  1.381108e-98
Gm10076  7.796334e-103  1.5444308 0.298 0.116  1.506486e-98
Ddx3y    6.434694e-101  9.1173564 0.097 0.000  1.243376e-96
Dynll1   5.417891e-100  0.5808176 0.839 0.648  1.046899e-95
Fau       6.982363e-99  0.5024963 0.957 0.927  1.349202e-94
Ddx3x     1.764944e-91 -0.6104509 0.613 0.786  3.410402e-87
H3f3a     1.185058e-90  0.4744317 0.902 0.778  2.289887e-86
Hspa1b    1.487259e-90  1.0772155 0.465 0.266  2.873830e-86
Fus       4.370940e-89 -0.6613164 0.593 0.763  8.445967e-85
Hsp90ab1  1.388752e-88  0.4396815 0.976 0.949  2.683485e-84
S1pr1     1.723727e-87  0.6229701 0.763 0.593  3.330758e-83
Eef1a1    1.781687e-87  0.4140337 0.984 0.973  3.442753e-83
Cd9       5.205205e-86  0.5263754 0.806 0.613  1.005802e-81
Iigp1     6.534946e-78 -1.6047647 0.225 0.386  1.262748e-73
Naca      1.120758e-72  0.5636949 0.725 0.556  2.165640e-68
Cav1      1.127768e-72  0.4981342 0.891 0.738  2.179186e-68
Eif1      6.571687e-71  0.3575360 0.970 0.930  1.269847e-66
Sema3c    9.215150e-71  0.5811113 0.825 0.652  1.780644e-66
S100a10   2.045689e-68  0.8197572 0.485 0.316  3.952885e-64
Myl6      1.615978e-67  0.3685493 0.950 0.887  3.122554e-63
Cox7c     2.235938e-65  0.5915378 0.639 0.459  4.320503e-61
Retnla    4.877968e-65 -4.4930798 0.004 0.073  9.425697e-61
Hspa8     5.845627e-65  0.3964812 0.911 0.808  1.129551e-60
Ifitm2    4.595612e-64  0.4178729 0.891 0.772  8.880102e-60
Tmem252   2.580587e-63  1.3840250 0.232 0.101  4.986468e-59
Epas1     2.362254e-61 -0.3732225 0.968 0.976  4.564584e-57
Tcim      3.381336e-61  0.9054496 0.394 0.239  6.533755e-57
Ctla2a    8.005311e-61  0.4702063 0.903 0.773  1.546866e-56
Ppia      5.509536e-60  0.3534653 0.956 0.898  1.064608e-55
Tmsb4x    3.897785e-57  0.2150742 0.988 0.981  7.531691e-53
Cxcl12    1.021704e-56  0.6569573 0.618 0.457  1.974238e-52
Rack1     2.188240e-56  0.4776253 0.824 0.733  4.228336e-52
Clic1     4.040476e-56  0.5289241 0.658 0.506  7.807412e-52
Gstp1     4.365991e-56  1.5489528 0.162 0.058  8.436404e-52

5. Additional Analysis Tools

After differential expression, analysis can go in a near-infinite number of directions in order to answer the scientific question at heart. We will highlight a few common requests, but these are in no way the only available options as far as tools nor scope.

5.1 Pathway Analysis

A frequent question after differential expression is “How do these differentially expressed genes interact to produce our biological outcome?” This can be explored through pathway and gene set enrichment analysis. Several methods are commonly used and will be explored in this section.

5.1.1 Module scoring

A de novo approach is to utilize module scoring, which has an accompanying function in Seurat. In this technique, the investigator has gathered a set of genes that defines a particular subset, and is curious about how this set is represented in the data. The AddModuleScore function in Seurat takes the average expression of the genes of interest and compares their expression to a random background selection to provide a score for the gene module’s overexpression

# A selection of genes that defines the treatment group endothelial cells
gene_module = c("Ubb","Uba52","Cd74","Vim","Hspa1a","Xist","Hspb1")

#applying the module score to the entire data set
so_merged = AddModuleScore(so_merged,features=list(gene_module), name = "TestModule")
#Multiple gene modules can be added inside the list, and each module will be 
#saved under the name chosen, with a numeric follower, e.g. TestModule1

#visualizing the module score in this set
Idents(so_merged) <- so_merged$lung_tm_cell_ontology
FeaturePlot(so_merged,features="TestModule1",order= T, label=T)

#This can be cleaned up by restricting the score limits to between the 5th and
#95th quantiles
FeaturePlot(so_merged,features="TestModule1",order= T, label =T, repel =T,
            min.cutoff ="q05", max.cutoff = "q95")

5.1.2 Over-representation analysis

Over-representation analysis (ORA) is method of determining if a gene set is particularly “over-represented” in a selection of genes. One way of visualizing ORA is to imagine a Venn diagram, where one circle is the list of differentially expressed genes, and the second circle is the list of genes in the gene set of interest. ORA can then be used to calculate a statistic for the intersection where the two circles meet. Here, we will look at ORA using the hypergeometric test; another test that is often used is the Fisher’s exact test. We will take the top 50 over-represented genes from our differential expression, and see if they are likely members of the gene set associated with the complement (immune response) gene set.

#Select the top 50 overrepresented genes 
DE_genes = grep("Rp",rownames(bulk_endothelial_DE[which(bulk_endothelial_DE$avg_log2FC>0),]),invert=T,value=T)[1:50]

#Obtain the Hallmark GSEA signatures (included here) from 
#https://www.gsea-msigdb.org/gsea/downloads.jsp and import the Hallmark dataset
#with the GSEABase package
hallmark = GSEABase::getGmt("../data/mh.all.v2026.1.Mm.symbols.gmt")
#Select the genes in the Complement gene set
complement_genes = hallmark[["HALLMARK_COMPLEMENT"]]@geneIds

#Use the phyper function to calculate the likelihood of the overlap between the top 50 genes and the complement pathway
overlap = length(intersect(DE_genes,complement_genes)) #4 genes
complement_gene_size = length(complement_genes) #185 genes
DE_gene_size = length(DE_genes) #50 selected genes
genome_size = nrow(bulk_endothelial_DE) #23,436 genes

#We set lower.tail to FALSE because we want to calculate the likelihood that we would obtain more genes (or enrichment)
hypergeometric_stat = phyper(q = overlap, m = complement_gene_size, 
     n = (genome_size-complement_gene_size), k = DE_gene_size, lower.tail = FALSE)

print(hypergeometric_stat)

[1] 4.612604e-05

Essentially, what we calculate:

If we randomly choose 50 genes from >23,000 genes in the data;
And of these 50 genes, 4 or more genes exist in the Complement gene set;
Which is comprised of 185 genes
The likelihood of this occurring is 4.61E-05

Although a Venn diagram of 46/4/181 does not seem like much, the odds of this happening given the >23,000 genes that could be randomly selected drive the significance of the result.

One drawback of overrepresentation analysis is the somewhat arbitrary selection of significant genes, in that the criteria for identifying genes is determined by the investigator. If the number of selected genes is increased to 500 genes, the likelihood of the overlap becomes more susceptible to chance.

5.1.3 Gene set enrichment analysis

As hinted at in the ORA section, several tools are available for interacting with gene set enrichment analysis (GSEA). Its best-known version was published in 2012 and is still maintained at the Broad Institute (https://www.msigdb.org). The classic version was designed for microarray and subsequently bulk RNASeq, with calculations based on the average expression of replicates. This application becomes unwieldy at the single cell level, since each cell is considered an individual replicate, and the gene expression is not as comprehensive for technical reasons.

For this example, we will again use the GSEABase package to retrieve the Hallmark gene sets from MSigDB, and we will implement fGSEA to intersect the differential genes with gene set enrichment. One particular flavor of GSEA that is more readily applicable to single cell data is pre-ranked GSEA, in which the genes are sorted by a user-defined metric, such as log2 fold change, and the order of the genes is used to determine the likelihood of a gene set being enriched.

hallmark_gene_list <- geneIds(hallmark)
rankings <- endothelial_group_DE$avg_log2FC
names(rankings)<-rownames(endothelial_group_DE)
fgsea_hallmark <- fgsea(
  pathways = hallmark_gene_list,
  stats = rankings,
  nperm = 10000 #Number of permutations. More permutations lead to more stable p-values
)

Warning in fgsea(pathways = hallmark_gene_list, stats = rankings, nperm =
10000): You are trying to run fgseaSimple. It is recommended to use
fgseaMultilevel. To run fgseaMultilevel, you need to remove the nperm argument
in the fgsea function call.

Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (12.67% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.

datatable(fgsea_hallmark[order(fgsea_hallmark$padj),])

#Generating the classic GSEA image for one of the pathways
plotEnrichment(hallmark_gene_list[["HALLMARK_OXIDATIVE_PHOSPHORYLATION"]],rankings) +
  labs(title="HALLMARK OXIDATIVE PHOSPHORYLATION")

Another popular ranking metric is -log10(pval) * sign(avg_log2FC), which will use the p-value as the primary driver of the ranking and use the sign of the fold change to ensure that the gene is assigned to the appropriate direction.

The results can be interpreted in a combination of the p-value for significance, and the enrichment score for direction. Positive enrichment scores favor the “positive” identity, which in this case is the treated cells, ident.1 in the original DE analysis. From these results, we can interpret that based on the log2 fold change rankings, MYC targets and oxidative phosphorylation are significantly enriched in the treated cohort, while interferon alpha and interferon gamma responses are moderately subdued, with adjusted p-values beyond the typical 0.05 threshold. Interestingly, TNFA-NFKB signaling is moderately enriched in the treated cohort; a search of the opposing immune responses suggests that this behavior is often observed in TNF-mediated autoimmune diseases, as well as recovery from influenza infection (https://doi.org/10.1186/ar3150).

Additional tools that can analyze gene sets at the per-cell or per-cluster level include ssGSEA, GSVA, and AUCell.

5.2 Trajectory Analysis

Trajectory analysis is commonly used for evaluating pseudo-time course experiments, in which we might be interested in seeing the developmental growth of cell populations. Unless the data are collected in a time course experiment, the cells represent a freeze-frame, but some transitions states can be inferred from gradual differences in the transcriptional profiles.

Here we will demonstrate the usage of Monocle3, developed by the Trapnell lab, which has a convenient wrapper to interact with Seurat data, following the SeuratWrapper vignette. Other tools for trajectory and pseudotime analysis include PAGA, slingshot, and TSCAN. We will isolate the immune cells, and in theory, a path can be established from the least developed to the most mature cells.

#Subset the immune cells by excluding all other types
immune_cells = subset(so_merged,
  cells=colnames(so_merged)[which(so_merged$lung_tm_cell_ontology %in% c("ciliated columnar cell of tracheobronchial tree",
    "lung endothelial cell","stromal cell","type II pneumocyte",NA))],
  invert=TRUE
)

#Reshape and visualize the immune cells
immune_cells <- FindNeighbors(immune_cells,dims=1:30)
immune_cells <- RunUMAP(immune_cells,dims=1:30)
DimPlot(immune_cells,group.by="lung_tm_cell_ontology", label = TRUE)

DimPlot(immune_cells,group.by="group")

#Prepare the immune cells for usage in Monocle3 (CellDataSet structure)
immune_cds <- as.cell_data_set(immune_cells)
immune_cds <- cluster_cells(immune_cds,k=8) #Monocle3 requires its own clustering and partitioning to run independently of any other tools
plot_cells(immune_cds, color_cells_by="cluster", label_cell_groups=TRUE, show_trajectory_graph = FALSE)

plot_cells(immune_cds, color_cells_by="partition", show_trajectory_graph = FALSE)

As we can see, the cell types are strictly partitioned, which makes sense; not all the cell types should be connected. We’ll still try to “force” a connection in the trajectory, with the ideal outcome being the identification of the cell type of origin and the terminal branches in immune cell differentiation. The starting point is required in order to run a pseudotime analysis.

immune_cds <- learn_graph(immune_cds, use_partition = FALSE)


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

plot_cells(immune_cds, show_trajectory_graph = TRUE, color_cells_by="lung_tm_cell_ontology", label_cell_groups = TRUE)

#Select a starting point at a semi-random cell in the myeloid population
set.seed(42)
start_cell = sample(colnames(so_merged)[which(so_merged$lung_tm_cell_ontology == "myeloid cell")], 1)
immune_cds <- order_cells(immune_cds,root_cells = start_cell)
plot_cells(immune_cds, color_cells_by="pseudotime", label_cell_groups=TRUE)

One of the more interesting results shown here are 2 distinct populations identified as B cells that have differing points of origin; the lower and larger population in the bottom right of the UMAP (nodes 5 and 6) branches directly from the myeloid population, while the upper and smaller population (node 3) is connected more logically to the T and NK cell populations.

Clearly, there is a distinction between the T and NK cells (adaptive immune system) and the innate immune system. In order to attain more precision in the pseudotime, one option is to tweak the parameters used in the clustering, partitioning, and points of origin. Another option would be to further subset the data to isolate the cell types of interest, and re-run the trajectory and pseudotime.

5.3 CNV Inference

In cancer biology, copy number variation can be a strong indicator of which cells are tumor cells. We will not explore the methods here today since they can be very computationally heavy, but rather provide a list of tools that can be used with scRNASeq data. Copy number variation is canonically determined using a deeper sequencing method than RNASeq, such as whole genome DNA sequencing, but it can be inferred or implied based on the raw RNA counts of genes transcribed from shared chromosomal regions.

inferCNV: Long considered one of the standard CNV inference tools, inferCNV is a comprehensive tool that asks for the user to identify a normal cell population that can be used as a background expression reference. After normalizing the regional counts against the background, the CNV profile can be determined in the potential tumor cells, and then clustered to identify subpopulations within the tumor cells. NOTE: As of about 3 months ago (Nov 2025), a notice has been posted to the GitHub page that inferCNV is no longer supported
CopyKAT: CopyKAT uses Bayesian approaches to identify genome-wide aneuploidy in single cell data to distinguish tumor cells from normal cells. The underlying principle assumes that a basic level of RNA counts can be calculated from the data set as a whole, and then tumor cells can be identified based on how much they vary from the basal level.
SCEVAN: SCEVAN starts with the raw RNA counts and identifies likely tumor cells based on a variational algorithm that uses “an optimization-based joint segmentation algorithm.” In layman’s terms, it separates cells based on similarity in likely copy number profile, and then classifies them as tumor or normal clusters, with the tumor clusters being assumed into a clonal population.