Getting Started with Seurat: Differential Expression and Classification
1. Introduction and Learning Objectives
This tutorial has been designed to demonstrate common secondary analysis steps in a scRNA-Seq workflow. We will start with a merged Seurat Object with multiple data layers representing multiple samples that have already been filtered and undergone preliminary clustering.
Throughout this tutorial we will
- Explore setting and visualizing identities in a single cell dataset\
- Perform differential expression analysis through Seurat\
- Use differentially expressed genes to classify cells\
- Run a case test of cell type annotation using SingleR
This tutorial largely follows the standard unsupervised clustering workflow by Seurat and the differential expression testing vignette, with slight deviations and a different data set.
Warning
This document in no way represents a complete analysis. Here, we are demonstrating many of the standard steps in Seurat. Results here would be considered preliminary, requiring greater care and consideration of biology.
The data
In this tutorial, we will continue to use data from Nanduri et al. 2022, Epigenetic regulation of white adipose tissue plasticity and energy metabolism by nucleosome binding HMGN proteins, published in Nature Communications.
As a reminder, this study examined "the role of HMGNs in white adipocyte browning by comparing wild-type (WT) mice and cells to genetically derived mice and cells lacking the two major members of the HMGN protein family, HMGN1 and HMGN2 (DKO mice).". The experimental design included WT vs DKO mice, 2 replicates each, at two time points, 0 and 6 days. At day 0, cells were in a preadipocyte state, while at day 6 they had differentiated into adipocytes. Each time point had two replicates of each condition (WT vs DKO), for a total of 8 samples.
An R project directory containing these data and associated files can be downloaded here.
Consider the biology in your analysis.
The biology of your experiment will inform your analysis. While scRNA-Seq can be exploratory, you should have some general idea of the cell types that you expect. Knowing some characteristics of expected cells, for example, cell size or whether cells are complex or more energetically active is helpful for data processing.
For example, in this experiment, we expect cells transitioning from preadipocytes to adipocytes and then beige adipocytes.
Load the packages
library(tidyverse) # dplyr and ggplot2; CRAN
library(Seurat) # Seurat toolkit; CRAN
library(hdf5r) # for data import; CRAN
library(patchwork) # for plotting; CRAN
library(presto) # for differential expression; Github
library(glmGamPoi) # for sctransform; Bioconductor
library(ggplot2) # for visualization; installed with tidyverse
library(dplyr) #just to make sure it was called; installed with tidyverse
library(SingleR) # for cell type annotation; Bioconductor
library(celldex) # for cell type annotation reference; Bioconductor
library(MAST) # for differential expression; Bioconductor
Load the Seurat Object
Here, we will start with the data stored in a Seurat object. For instructions on data import and creating the object, see an Introduction to scRNA-Seq with R (Seurat) and Getting Started with Seurat: QC to Clustering.
adp <- readRDS("../outputs/adp_merge_filt_sctran_clust0.1.rds")
mem.maxVSize()
[1] 16384
mem.maxVSize(vsize=16384*4)
[1] 65536
Tip
Many of the steps used in differential expression can be very memory-intensive, and on some computers, this could lead to errors or warnings that look like this:
Found 8 SCT models. Recorrecting SCT counts using minimum median counts: 8146
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 58s
Error: vector memory limit of 16.0 Gb reached, see mem.maxVSize()
Error during wrapup: vector memory limit of 16.0 Gb reached, see mem.maxVSize()
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
This particular error indicates that there isn't enough virtual memory
allocated to run a process (in this case, PrepSCTFindMarkers
). This
can be corrected by adjusting the memory that R is allowed to access.
mem.maxVSize()
#[1] 16384
mem.maxVSize(vsize=16384*4)
#[1] 65536
For more details, please see the documentation on memory allocation.
Examine the object:
glimpse(adp)
Formal class 'Seurat' [package "SeuratObject"] with 13 slots
..@ assays :List of 2
.. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
.. ..$ SCT:Formal class 'SCTAssay' [package "Seurat"] with 9 slots
..@ meta.data :'data.frame': 42114 obs. of 11 variables:
.. ..$ orig.ident : chr [1:42114] "D10" "D10" "D10" "D10" ...
.. ..$ nCount_RNA : num [1:42114] 11188 32832 28515 18954 27181 ...
.. ..$ nFeature_RNA : int [1:42114] 3383 5823 5002 4305 4586 5015 3927 4006 5489 4093 ...
.. ..$ condition : chr [1:42114] "DKO" "DKO" "DKO" "DKO" ...
.. ..$ time_point : chr [1:42114] "Day 0" "Day 0" "Day 0" "Day 0" ...
.. ..$ cond_tp : chr [1:42114] "DKO Day 0" "DKO Day 0" "DKO Day 0" "DKO Day 0" ...
.. ..$ percent.mt : num [1:42114] 3.16 1.8 1.39 1.89 1.33 ...
.. ..$ nCount_SCT : num [1:42114] 23551 25705 25491 23790 25367 ...
.. ..$ nFeature_SCT : int [1:42114] 3843 5809 5002 4306 4586 5015 3967 4038 5425 4181 ...
.. ..$ SCT_snn_res.0.1: Factor w/ 8 levels "0","1","2","3",..: 8 1 1 1 1 1 8 1 3 7 ...
.. ..$ seurat_clusters: Factor w/ 8 levels "0","1","2","3",..: 8 1 1 1 1 1 8 1 3 7 ...
..@ active.assay: chr "SCT"
..@ active.ident: Factor w/ 8 levels "0","1","2","3",..: 8 1 1 1 1 1 8 1 3 7 ...
.. ..- attr(*, "names")= chr [1:42114] "D10_AAACCCAAGATGCTTC-1" "D10_AAACCCAAGCTCTTCC-1" "D10_AAACCCAAGTCATCGT-1" "D10_AAACCCAGTAGCGTCC-1" ...
..@ graphs :List of 2
.. ..$ SCT_nn :Formal class 'Graph' [package "SeuratObject"] with 7 slots
.. ..$ SCT_snn:Formal class 'Graph' [package "SeuratObject"] with 7 slots
..@ neighbors : list()
..@ reductions :List of 2
.. ..$ pca :Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
.. ..$ umap:Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
..@ images : list()
..@ project.name: chr "Adipose"
..@ misc : list()
..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
.. ..$ : int [1:3] 5 0 1
..@ commands :List of 5
.. ..$ SCTransform.RNA :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
.. ..$ RunPCA.SCT :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
.. ..$ FindNeighbors.SCT.pca:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
.. ..$ FindClusters :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
.. ..$ RunUMAP.SCT.pca :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
..@ tools : list()
Reviewing the Data
As we can see above, the glimpse
command shows the metadata that can
be used to classify the cells. Within Seurat, the metadata is used to
define the "identity" of the dataset. This is critical, as the labels
within the categories will be used to determine which groups of cells
are being compared. One method of viewing the available labels is to use
the table
function, which lists the frequency of each label:
table(adp$cond_tp)
DKO Day 0 DKO Day 6 WT Day 0 WT Day 6
11655 10512 11807 8140
The table
function can also be used in two dimensions:
table(adp$cond_tp,adp$orig.ident)
D10 D16 D20 D26 W10 W16 W20 W26
DKO Day 0 6990 0 4665 0 0 0 0 0
DKO Day 6 0 4674 0 5838 0 0 0 0
WT Day 0 0 0 0 0 6780 0 5027 0
WT Day 6 0 0 0 0 0 3781 0 4359
Before running any sort of differential expression, the identity of the Seurat object needs to be explicitly defined. Different identities can be assigned, based on what needs to be analyzed. Recall this visualization of the original identities, clusters, time point, experimental condition, and combined time point and condition:
DimPlot(adp, reduction = "umap", group.by = c("orig.ident", "seurat_clusters","time_point","condition","cond_tp"),
alpha=0.4, ncol=2)
Here we will set the identity to the clusters, and view the frequency of the resulting cell identities:
Idents(adp) = "SCT_snn_res.0.1"
table(Idents(adp))
0 1 2 3 4 5 6 7
16954 7635 6731 6481 2326 1085 556 346
Alternate identity classification
If there are identities defined in another vector, the vector values can be explicitly assigned to the dataset, without adding the identities to the metadata itself:
clusters=adp$SCT_snn_res.0.1
Idents(adp)=clusters
However, if the identities are overwritten, you will need to call the vector again, unless the vector is added to the metadata.
2. Differential Expression
What is differential expression analysis?
Differential expression analysis is the process of identifying genes that have a significant difference in expression between two or more groups. For many sequencing experiments, regardless of methodology, differential analysis lays the foundation of the results and any biological interpretation.
Different tools can be used for differential gene expression analysis, depending on the underlying statistics. One of the best known statistical tests is the Student's t-test, which tests for how likely it is that two sets of measurements come from the same normal distribution. However, this does assume that the data is normally distributed, meaning that the data has the distinctive bell curve shape. Should this assumption be inconsistent, a non-parametric version of this test is the Wilcoxon rank-sum test, which tests how likely it is that one population is larger than the other, regardless of probability distribution size. Alternative distribution shapes can include bimodal or multimodal distributions, with multiple distinct peaks, and skewed distributions with data that favors one side or another (examples).
For many sequencing experiments, the assumption of normality is rarely guaranteed. Some of the more popular tools for bulk RNASeq experiments, such as DESeq2, limma, and edgeR, acknowledge this, and use different statistical models to identify and interpret differences in gene expression.
Single cell RNASeq is notorious for not having a normal distribution of gene expression. For technical reasons, scRNASeq has a transcript recovery rate between 60%-80%, which naturally skews the gene expression towards the non-expressed end.
plot(density(sample(JoinLayers(adp@assays$RNA)$count["Gapdh",],2500)),cex=0,lty=1, main="Density of Gapdh in 2500 random cells")
hist(sample(JoinLayers(adp@assays$RNA)$count["Gapdh",],2500),breaks=99,main="Histogram of Gapdh in 2500 random cells",ylab="Frequency",xlab="Gene counts")
Note
Viewing the code, JoinLayers
was called to link all the data sets
together. This is a new feature of Seurat5, and is required for
analyzing data after integration and batch correction
(ref).
The Seurat tool acknowledges this, and by default uses the Wilcoxon
rank-sum test to identify differentially expressed genes, via the
presto
package. Another algorithm that is often used is MAST, which
implements a hurdle model to determine the likelihood that a gene is
genuinely not expressed in a cell, or if it is more likely a technical
dropout.
Running Differential Expression in Seurat
The primary means of running differential expression in Seurat is
through the FindMarkers
function. The main usage for this function is
as follows:
FindMarkers(object,ident.1= ..., ident.2=..., test.use="wilcox", min.pct = 0.01, logfc.threshold = 0.1)
object
: The Seurat object being examinedident.1
: The first or main label in the comparison. Positive fold change values indicate upregulation inident.1
ident.2
: The second label in the comparison. Ifident.2
is not defined, it is automatically set to all remaining cells in the comparison. In contrast toident.1
, negative fold changes indicated upregulation inident.2
.test.use
: The statistical test that is used to compare genes betweenident.1
andident.2
. The default values is the Wilcoxon rank-sum testmin.pct
: The minimum fraction of cells in at least one of the two groups that must express the gene in order to be compared. If this value is not exceeded in both populations, the gene is excluded from the differential expression analysis.logfc.threshold
: The minimum difference in average fold change between the two populations before the gene is evaluated. If the gene is below the threshold, i.e. the expression between the populations is too similar, the gene will excluded from the DE analysis.
In the event that a user wants to recover DE statistics for all genes,
min.pct
and logfc.threshold
can be set to 0. Additional parameters,
including other statistical tests, can be found in the Help menu for
FindMarkers
.
Prepare the dataset for analysis
In Seurat (since version 4), differential analysis requires a preprocessing step to appropriately scale the normalized SCTransform assay across samples:
adp = PrepSCTFindMarkers(adp)
Found 8 SCT models. Recorrecting SCT counts using minimum median counts: 8146
As covered earlier, the identities of the Seurat object will also need to be defined. To start, we will use the clusters that were generated in the previous workshop:
Idents(adp) = "SCT_snn_res.0.1"
It is also strongly recommended to explicitly set the SCT
assay as the
basis for the differential expression. Earlier workflows that do not use
SCTransform for normalizaton use log-normalization and scaling on the
RNA
assay, and should be treated as such.
DefaultAssay(adp) = "SCT"
Setting up differential expression on a different assay
In the event that a different assay is being used for differential
expression (protein expression, ATACSeq data, initial RNA data), the
assay is defined with the same command. In this case, the
PrepSCTFindMarkers
function will not need to be run.
DefaultAssay(adp) = "RNA"
or
DefaultAssay(adp)="CITESeq"
Warning
Under no circumstances should the batch corrected or integrated
assay
be used for differential expression. The batch correction is used
primarily for dimensionality reduction, UMAP/T-SNE visualization, and
clustering. In order to optimize this process, only a select subset of
highly variable genes (~3000) is used, and this is reflected in the
size of the assay. If differential expression is run on this assay, the
differences will be accentuated, and only return the ~3000 genes that
were selected for batch correction.
2a. FindAllMarkers
The FindAllMarkers
function is particularly useful in identifying the
differentially expressed genes that distinguish several groups, such as
seen here in the clusters. What makes this unique is that none of the
identities are initially defined, as each identity will be compared with
the rest of the cohort.
de_allClusters = FindAllMarkers(adp, test.use="wilcox", min.pct=0.1, only.pos=TRUE)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Note
For expediency, this is run with a stricter min.pct
threshold on the
default Wilcoxon rank-sum test, and turns on the function-specific
parameter only.pos
, which is used to return only positively expressed
markers from each cluster, as opposed to markers that are both
significantly upregulated and downregulated.
We can peek at the results using the following commands:
head(de_allClusters)
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
Emp3 0 1.437535 0.866 0.290 0 0 Emp3
Tagln 0 1.647233 0.852 0.299 0 0 Tagln
Mfap5 0 1.877743 0.958 0.418 0 0 Mfap5
Igfbp6 0 2.028028 0.806 0.281 0 0 Igfbp6
Cd9 0 1.678611 0.760 0.261 0 0 Cd9
Acta2 0 1.559173 0.792 0.295 0 0 Acta2
nrow(de_allClusters)
[1] 25435
The head
function lets us look at the first 10 of so lines of the
table, where we can see the genes are in the rows, with relevant
statistics. The nrow
function then tells us how many genes have been
compared that matched the logfc.threshold
and min.pct
criteria. It
should be noted that these criteria do not necessarily mean that only
the significant genes have been returned; the table may still contain
genes that are not significantly dysregulated.
The differential expression table from FindAllMarkers
contains the
following statistics:
-
p_val
: The raw p-value as calculated by the chosen algorithm. -
avg_log2FC
: The average log2 fold change between the first group and the second group. In this case, it is the cluster in question and the remainder of the cells, respectively. -
pct.1
: The fraction of cells in the first group (cluster in question) that express the gene. -
pct.2
: The fraction of cells in the second group (all other cells) that express the gene. -
p_val_adj
: The adjusted p-value, as determined using False Discovery Rate (also known as Benjamini-Hochberg) algorithm. -
cluster
: The cluster that was being evaluated as "Population 1". -gene
: Reiterates the gene that is being evaluated. This is necessary because R requires unique row names.
Here we will take a look at the top 5 positively differentially expressed genes of each cluster:
top5PerCluster = matrix(ncol=7)
colnames(top5PerCluster) = colnames(de_allClusters)
for (i in 0:7){
top5PerCluster = rbind(top5PerCluster, head(de_allClusters[which(de_allClusters$cluster==i),], 5))
}
top5PerCluster=top5PerCluster[-1,]
top5PerCluster
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
Emp3 0 1.4375351 0.866 0.290 0 0 Emp3
Tagln 0 1.6472332 0.852 0.299 0 0 Tagln
Mfap5 0 1.8777432 0.958 0.418 0 0 Mfap5
Igfbp6 0 2.0280281 0.806 0.281 0 0 Igfbp6
Cd9 0 1.6786106 0.760 0.261 0 0 Cd9
Ifi207 0 1.8002046 0.751 0.370 0 1 Ifi207
A2m 0 1.8799582 0.613 0.238 0 1 A2m
Folh1 0 2.5556965 0.458 0.085 0 1 Folh1
Scara5 0 3.0511756 0.427 0.056 0 1 Scara5
Fzd1 0 1.5380712 0.597 0.247 0 1 Fzd1
Mki67 0 4.8278546 0.930 0.071 0 2 Mki67
Pclaf 0 4.7049029 0.935 0.078 0 2 Pclaf
Birc5 0 4.6066362 0.929 0.077 0 2 Birc5
Tpx2 0 4.6715031 0.899 0.079 0 2 Tpx2
Top2a 0 4.5592408 0.927 0.127 0 2 Top2a
Car3 0 0.4065029 0.917 0.185 0 3 Car3
A2m1 0 1.7465643 0.906 0.197 0 3 A2m
Cd361 0 1.1648713 0.957 0.252 0 3 Cd36
Cdo1 0 1.2141175 0.913 0.238 0 3 Cdo1
Pnpla2 0 1.4005127 0.934 0.290 0 3 Pnpla2
Lipe1 0 4.2099355 0.947 0.156 0 4 Lipe
Plin11 0 4.0682072 0.945 0.159 0 4 Plin1
Arxes11 0 3.9698014 0.899 0.126 0 4 Arxes1
Slc36a21 0 4.4037697 0.833 0.069 0 4 Slc36a2
Pcx1 0 3.7673292 0.890 0.148 0 4 Pcx
Tm4sf12 0 1.9998484 0.836 0.280 0 5 Tm4sf1
Pdgfb 0 6.0929653 0.180 0.005 0 5 Pdgfb
Upp1 0 6.8186440 0.161 0.003 0 5 Upp1
Cdh5 0 7.7923303 0.157 0.007 0 5 Cdh5
Pecam1 0 8.0957976 0.150 0.003 0 5 Pecam1
Tyrobp 0 11.2901172 0.996 0.006 0 6 Tyrobp
C1qc 0 11.4203660 0.996 0.007 0 6 C1qc
Fcer1g 0 11.0144789 1.000 0.012 0 6 Fcer1g
Trem2 0 10.4536558 0.993 0.006 0 6 Trem2
Ctss 0 10.9417908 0.991 0.005 0 6 Ctss
Wfdc211 0 7.6043603 0.850 0.035 0 7 Wfdc21
Lrg11 0 5.2555628 0.974 0.169 0 7 Lrg1
Plin41 0 3.7513090 0.893 0.115 0 7 Plin4
Slc5a31 0 4.3232241 0.893 0.160 0 7 Slc5a3
Cfd1 0 9.0781154 0.749 0.043 0 7 Cfd
This can be visualized with a heatmap:
DoHeatmap(adp,features = top5PerCluster$gene,slot="scale.data")
Warning in DoHeatmap(adp, features = top5PerCluster$gene, slot = "scale.data"):
The following features were omitted as they were not found in the scale.data
slot for the SCT assay: Ctss, C1qc, Tyrobp, Pdgfb
Tip
Here we are specifying that we are using the scale.data
slot in the
SCT
assay. During the SCTransform process, the SCT
assay has the
slot for scale.data
filled; however, this may not contain all genes as
it contains only the most highly variable genes. In the example above,
Ctss, C1qc, Tyrobp, and Pdgfb have been excluded for this reason. In
order to ensure that all genes are represented, the data
slot can be
directly called as the data reference, but it will not be z-scaled.
2b. FindMarkers
Many experiments look to compare to distinct populations, and scRNASeq is no exception. The two populations being compared can vary wildly from experiment to experiment; some look to draw comparisons based on the experimental condition, while others may look to compare different clusters within the experiment itself.
Here, we will be drawing a comparison between the two timepoints. First, we need to set the identities of the object:
Idents(adp) = "time_point"
table(Idents(adp))
Day 0 Day 6
23462 18652
And then we set up FindMarkers
to compare our two conditions:
day6_day0_de=FindMarkers(adp,ident.1="Day 6",ident.2="Day 0",test.use="wilcox")
head(day6_day0_de, 10)
p_val avg_log2FC pct.1 pct.2 p_val_adj
Emp3 0 -5.413609 0.060 0.889 0
Tagln 0 -5.992922 0.080 0.873 0
Acta2 0 -6.098134 0.079 0.825 0
Tpm2 0 -5.040596 0.084 0.805 0
Cd36 0 5.349745 0.762 0.042 0
A2m 0 9.465772 0.687 0.004 0
Cd9 0 -4.134289 0.082 0.764 0
Rbp4 0 3.769205 0.841 0.160 0
Mfap5 0 -3.605865 0.257 0.937 0
Igfbp6 0 -3.786182 0.115 0.792 0
As you can see, most of the column headers are the same as the results
from FindAllMarkers; in fact, there are fewer because the system does
not need to account for which cluster is being compared against the rest
and all the gene IDs should be unique. As mentioned previously,
ident.1
will have the placeholder of the positive fold change. Another
visualization of the differential expression can be performed using the
FeaturePlot
function:
fig1 = DimPlot(adp,group.by="time_point")
fig2 = FeaturePlot(adp,features="Acta2",order=T)
fig3 = FeaturePlot(adp,features="Cd36",order=T)
fig1/(fig2|fig3)
2c. Pseudobulk Differential Expression
One particular critique of differential expression in single cell RNASeq
analysis is p-value "inflation," where the p-values get so small that
there are far too many genes exist with p-values below 0.05, even after
adjustment. This is caused by the nature of scRNASeq, where each
individual cell essentially consists of a single replicate; as the
replicate count increases, the p-values tend to shrink. Generally, we
recommend using the DE values as a guide for significant differential
gene identification, as opposed to using the p-value for a hard cutoff.
The p-value, which is an expression of a probability of a real result,
should be examined in context with the remaining statistics, such as
fraction expression (pct.1
and pct.2
) and fold change. The entire
picture of the statistical comparison will aid in determining which
events are likely to be real and therefore actionable pursuits.
One method to counter the p-value inflation is to run a pseudobulk analysis. In this process, the total gene counts for each population of interest is aggregated across cells, and then these total counts can be treated in the same manner as a bulk RNASeq experiment. Through this process, the replicate inflation is overcome, as well as any cell-to-cell variation that may be excessively skewing results.
Here, the pseudobulk DE is run between timepoints, as before. It starts
with the AggregateExpression
function in Seurat, and collapsing the
counts by replicate. Be sure to include all potential metadata that will
be of interest.
pseudo_adp=AggregateExpression(adp,assays="RNA", return.seurat=T, group.by=c("orig.ident","time_point","condition","cond_tp"))
head(pseudo_adp@assays$RNA$counts)
6 x 8 sparse Matrix of class "dgCMatrix"
D10_Day 0_DKO_DKO Day 0 D16_Day 6_DKO_DKO Day 6 D20_Day 0_DKO_DKO Day 0
Xkr4 340 27 239
Gm19938 338 60 188
Sox17 128 9 61
Mrpl15 21040 5015 10944
Lypla1 4339 1779 3024
Tcea1 31828 4905 15522
D26_Day 6_DKO_DKO Day 6 W10_Day 0_WT_WT Day 0 W16_Day 6_WT_WT Day 6
Xkr4 31 411 33
Gm19938 71 302 55
Sox17 . 128 .
Mrpl15 7359 16171 4475
Lypla1 2608 4610 1709
Tcea1 7095 26408 4715
W20_Day 0_WT_WT Day 0 W26_Day 6_WT_WT Day 6
Xkr4 278 37
Gm19938 196 82
Sox17 31 5
Mrpl15 12725 5861
Lypla1 3020 1995
Tcea1 17656 6041
pseudo_adp@meta.data
orig.ident time_point condition cond_tp
D10_Day 0_DKO_DKO Day 0 D10_Day 0_DKO_DKO Day 0 Day 0 DKO DKO Day 0
D16_Day 6_DKO_DKO Day 6 D16_Day 6_DKO_DKO Day 6 Day 6 DKO DKO Day 6
D20_Day 0_DKO_DKO Day 0 D20_Day 0_DKO_DKO Day 0 Day 0 DKO DKO Day 0
D26_Day 6_DKO_DKO Day 6 D26_Day 6_DKO_DKO Day 6 Day 6 DKO DKO Day 6
W10_Day 0_WT_WT Day 0 W10_Day 0_WT_WT Day 0 Day 0 WT WT Day 0
W16_Day 6_WT_WT Day 6 W16_Day 6_WT_WT Day 6 Day 6 WT WT Day 6
W20_Day 0_WT_WT Day 0 W20_Day 0_WT_WT Day 0 Day 0 WT WT Day 0
W26_Day 6_WT_WT Day 6 W26_Day 6_WT_WT Day 6 Day 6 WT WT Day 6
#just to clean up the look a little bit
pseudo_adp = RenameCells(pseudo_adp,new.names=gsub("_.*","",pseudo_adp$orig.ident))
pseudo_adp$orig.ident=gsub("_.*","",pseudo_adp$orig.ident)
head(pseudo_adp@assays$RNA$counts)
6 x 8 sparse Matrix of class "dgCMatrix"
D10 D16 D20 D26 W10 W16 W20 W26
Xkr4 340 27 239 31 411 33 278 37
Gm19938 338 60 188 71 302 55 196 82
Sox17 128 9 61 . 128 . 31 5
Mrpl15 21040 5015 10944 7359 16171 4475 12725 5861
Lypla1 4339 1779 3024 2608 4610 1709 3020 1995
Tcea1 31828 4905 15522 7095 26408 4715 17656 6041
pseudo_adp@meta.data
orig.ident time_point condition cond_tp
D10 D10 Day 0 DKO DKO Day 0
D16 D16 Day 6 DKO DKO Day 6
D20 D20 Day 0 DKO DKO Day 0
D26 D26 Day 6 DKO DKO Day 6
W10 W10 Day 0 WT WT Day 0
W16 W16 Day 6 WT WT Day 6
W20 W20 Day 0 WT WT Day 0
W26 W26 Day 6 WT WT Day 6
The data now is a matrix of counts by sample, as opposed to counts by cell. DESeq2 may now be run on the pseudobulk data:
Idents(pseudo_adp)="time_point"
bulk_adp_de = FindMarkers(pseudo_adp, ident.1="Day 6", ident.2="Day 0", test.use="DESeq2")
converting counts to integer mode
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
head(bulk_adp_de)
p_val avg_log2FC pct.1 pct.2 p_val_adj
Nabp1 0 2.249765 1 1 0
Prelp 0 -2.519092 1 1 0
Arl8a 0 -1.180647 1 1 0
Glul 0 2.291869 1 1 0
Fcer1g 0 -1.648088 1 1 0
Atp1a2 0 1.528571 1 1 0
Comparing the results between the single cell and pseudobulk analysis shows that there are more genes identified as significantly dysregulated via scRNASeq DE; the overall concordance sits at about 70% of the pseudobulk genes.
scDE.genes = rownames(day6_day0_de)[which(day6_day0_de$p_val_adj<0.05)]
bulkDE.genes = rownames(bulk_adp_de)[which(bulk_adp_de$p_val_adj<0.05)]
length(scDE.genes)
[1] 10337
length(bulkDE.genes)
[1] 9201
length(intersect(scDE.genes,bulkDE.genes))
[1] 6779
Checking our two targeted genes above shows that the pseudobulk analysis still identifies them as significantly differentially expressed:
bulk_adp_de[c("Acta2","Cd36"),]
p_val avg_log2FC pct.1 pct.2 p_val_adj
Acta2 1.769929e-268 -5.490767 1 1 3.655611e-264
Cd36 1.986865e-126 4.560422 1 1 4.103671e-122
3. Visualizing Differentially Expressed Genes
In addition to the Heatmap
and the FeaturePlot
shown previously, two
other options easily accessible through Seurat are the Dot Plot and
Violin Plot. The dot plot can visualize relative expression level and
expression fraction. Using the same genes from the cluster-derived DE
results:
Idents(adp)="SCT_snn_res.0.1"
DotPlot(adp,features=unique(top5PerCluster$gene),dot.scale = 3)+coord_flip()
The violin plot is another way to visualize top genes across a smaller subset of genes. Taking the two genes from the timepoint analysis:
Idents(adp) = "time_point"
VlnPlot(adp,features=c("Acta2","Cd36"),alpha = 0.1)
Note
Striation is normal. The SCTransform technique normalizes the counts themselves at the per-gene and per-sample levels, while the older log-normalization/scaling approach created a more continuous count distribution.
4. Cell Annotation
4a. Annotation by Differential Expressed Gene Markers
From the paper where this data was obtained, the following (incomplete) list of gene markers was obtained:
Mmp3: preadipocytes
Mki67: proliferating cells
Fabp4: differentiating beige adipocytes and differentiated beige adipocytes
Scd1: differentiated beige adipocytes
Ucp1: differentiated beige adipocytes
Ppargc1a: differentiated beige adipocytes
Elovl3: differentiated beige adipocytes
Cidea: differentiated beige adipocytes
Looking at the average expression of these genes across the clusters can provide insight into annotating the cells
markers=c("Mmp3","Mki67","Fabp4","Scd1","Ucp1","Ppargc1a","Elovl3","Cidea")
Idents(adp)="SCT_snn_res.0.1"
avgExp = AverageExpression(adp,markers,assay="SCT")$SCT
As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
First group.by variable `ident` starts with a number, appending `g` to ensure valid variable names
This message is displayed once per session.
avgExp
8 x 8 sparse Matrix of class "dgCMatrix"
g0 g1 g2 g3 g4
Mmp3 1.007001e+01 2.666666667 2.024810578 0.63200123 0.1474635
Mki67 5.520821e-02 0.081859856 5.157777448 0.38975467 0.2622528
Fabp4 3.586233e+00 50.806941716 20.006834051 158.68677673 450.8907997
Scd1 7.791082e-01 7.064178127 2.212895558 26.46952631 76.7214101
Ucp1 1.179663e-04 0.008382449 0.002971327 0.01481253 0.1294067
Ppargc1a 4.895600e-03 0.039816634 0.011142475 0.10862521 0.7923474
Elovl3 1.828477e-03 0.064047151 0.023919180 0.17636167 1.2682717
Cidea 2.595258e-03 0.041519319 0.014410934 0.12976393 0.6977644
g5 g6 g7
Mmp3 2.429493088 1.446043165 1.07803468
Mki67 0.459907834 1.937050360 0.43930636
Fabp4 11.935483871 2.183453237 137.46531792
Scd1 1.003686636 0.208633094 70.86416185
Ucp1 . . .
Ppargc1a 0.003686636 . 0.05202312
Elovl3 0.005529954 . .
Cidea 0.007373272 0.001798561 .
DimPlot(adp,label=T)
FeaturePlot(adp,features=markers,ncol=3,order=T)
Comparing the genes to the clusters, the following correlations are apparent:
-
Cluster 0: Mmp3
-
Cluster 1: Moderate Fabp4, slight Scd1
-
Cluster 2: Mki67, mild Fabp4
-
Cluster 3: Moderate Fabp4, Moderate Scd1
-
Cluster 4: Fabp4, Scd1, Ucp1, Ppargc1a, Cidea
-
Cluster 5: Moderate Mmp3, slight Fabp4
-
Cluster 6: Moderate Mki67
-
Cluster 7: Fabp4, Scd1
Based on these gene markers, the following cell identities could be assigned:
-
Preadipocytes: Clusters 0 and 5
-
Proliferating cells: Clusters 2 and 6
-
Differentiating beige adipocytes: Cluster 1 and 3
-
Differentiated beige adipocytes: Cluster 4
-
Unclassified: Cluster 7
These can be relabeled via vector assignment:
adipocyte=vector(length=ncol(adp))
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(0,5))]="Preadipocytes"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(2,6))]="Proliferating cells"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(1,3))]="Differentiating beige adipocytes"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(4))]="Differentiated beige adipocytes"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(7))]="Unclassified"
adp$adipocyte = adipocyte
f1 = DimPlot(adp,group.by="SCT_snn_res.0.1",label=T) + NoLegend()
f2 = DimPlot(adp,group.by="time_point") + NoLegend()
f3 = DimPlot(adp,group.by="adipocyte",label=T)+NoLegend()
(f1|f2)/f3
4b. Annotation using SingleR
Some tools have been described in the previous session (see
here).
Today, we will be focusing on the SingleR tool, which also requires the
celldex
package.
In short, SingleR operates by comparing your current dataset against a
reference dataset, mostly through correlative gene expression. It then
assigns a score for the expression set (i.e. each cell or cluster being
compared) for all possible reference labels. This walkthrough will
explore using one of the default references from celldex
to try to
assign annotations to our current dataset.
SingleR requires that the data be mildly transformed into a different
data structure. It is also slightly more efficient to retrieve and store
the reference dataset from celldex
preemptively. This also gives the
opportunity to peek into the reference data itself.
adp.sce = as.SingleCellExperiment(adp,assay="SCT") #This selects *only* the SCT assay
mouseRNASeq = celldex::MouseRNAseqData()
head(mouseRNASeq)
class: SummarizedExperiment
dim: 6 358
metadata(0):
assays(1): logcounts
rownames(6): Xkr4 Rp1 ... Lypla1 Tcea1
rowData names(0):
colnames(358): ERR525589Aligned ERR525592Aligned ... SRR1044043Aligned
SRR1044044Aligned
colData names(3): label.main label.fine label.ont
table(mouseRNASeq$label.main)
Adipocytes Astrocytes B cells Cardiomyocytes
13 27 5 8
Dendritic cells Endothelial cells Epithelial cells Erythrocytes
2 12 2 3
Fibroblasts Granulocytes Hepatocytes Macrophages
45 15 4 32
Microglia Monocytes Neurons NK cells
72 6 64 18
Oligodendrocytes T cells
12 18
table(mouseRNASeq$label.fine)
Adipocytes aNSCs Astrocytes
13 8 24
Astrocytes activated B cells Cardiomyocytes
3 5 8
Dendritic cells Endothelial cells Ependymal
2 12 2
Erythrocytes Fibroblasts Fibroblasts activated
3 27 9
Fibroblasts senescent Granulocytes Hepatocytes
9 15 4
Macrophages Macrophages activated Microglia
26 6 56
Microglia activated Monocytes Neurons
16 6 39
Neurons activated NK cells NPCs
4 18 11
Oligodendrocytes OPCs qNSCs
10 2 2
T cells
18
For many of the celldex
datasets, there are two separate categories of
labels. The main
labels capture the family of cell types available
(e.g. T cells), while the fine
category captures the specific subtypes
(e.g. Cd4 effector T cells). The references in celldex
are
species-specific, which has to be accounted for in any reference-based
cell type annotation.
Your annotation is only as good as your reference!
In many cases, one of the default reference datasets from celldex will
be a sufficient starting point. However, some studies will require more
specific cell types than what can be retrieved from celldex
. In these
cases, one may consider retrieving a dataset from the Bioconductor
package
scRNAseq
or from the Chan-Zuckerberg Biohub, which has collected and annotated
cell types from multiple organs from both mouse (Tabula
muris) and
human (Tabula
sapiens)
samples.
Version 1: Cell-level cell type annotation
When using SingleR, the 3 primary parameters are the experimental
dataset, the reference dataset, and the labels being used. Continuing
with the main
labels of the MouseRNASeq dataset on the full dataset
looks like this:
annot = SingleR(test=adp.sce, ref=mouseRNASeq, labels=mouseRNASeq$label.main)
head(annot)
DataFrame with 6 rows and 4 columns
scores labels delta.next
<matrix> <character> <numeric>
D10_AAACCCAAGATGCTTC-1 0.523325:0.275539:0.210645:... Adipocytes 0.2110472
D10_AAACCCAAGCTCTTCC-1 0.499954:0.250459:0.226184:... Fibroblasts 0.1624889
D10_AAACCCAAGTCATCGT-1 0.474176:0.213157:0.189314:... Fibroblasts 0.3422910
D10_AAACCCAGTAGCGTCC-1 0.449555:0.245758:0.228256:... Fibroblasts 0.0501651
D10_AAACCCAGTGCACGCT-1 0.398923:0.179533:0.207357:... Fibroblasts 0.0527357
D10_AAACCCATCCACTGAA-1 0.500808:0.250242:0.187568:... Fibroblasts 0.0797964
pruned.labels
<character>
D10_AAACCCAAGATGCTTC-1 Adipocytes
D10_AAACCCAAGCTCTTCC-1 Fibroblasts
D10_AAACCCAAGTCATCGT-1 Fibroblasts
D10_AAACCCAGTAGCGTCC-1 Fibroblasts
D10_AAACCCAGTGCACGCT-1 Fibroblasts
D10_AAACCCATCCACTGAA-1 Fibroblasts
Tip
This particular step can take a while, since it compares every cell in the dataset against every cell type transcriptome in the reference. One suggestion from the developers is to divide the dataset and run the annotation on each subset independently (in a process called "scatter and gather"). A version of this that works particularly well is to run the cell type annotation on each sample prior to combining the dataset. This isn't as much a problem with this toy dataset, but will become rather pronounced as data sets increase beyond \~75k cells or 10 samples.
The results from SingleR are returned as a dataframe:
-
scores
: Correlation scores for each cell against each of the reference columns. Accessed asannot$scores
in this case -
labels
: Initial labels based on highest scoring cell type annotation -
delta.next
: The difference in scores between the highest and second-highest scoring annotations -
pruned.labels
: SingleR determines if a label should be discarded because thedelta.next
value is too small, i.e. the label can be ambiguous. Any discarded labels will be replaced withNA
.
Here, the pruned.labels
are added to the meta.data
table(annot$pruned.labels, useNA="ifany") #useNA can be used turned on in the `table` function
Adipocytes Endothelial cells Fibroblasts Macrophages
13408 103 27692 480
<NA>
431
adp$mouseRNASeq.main=annot$pruned.labels
This is actually a great example of the warning above. The dataset is specifically adipocytes, and should only have a smattering of alternative cell types. However, a number of the cells are being returned as a fibroblasts. Visualizing the two sets of annotations reveals that many of the cells identified as proliferating adipocytes or preadipocytes have been labeled as fibroblasts. Interestingly, the two clusters that are more separate from the rest of the main body received alternative annotations as endothelial cells and macrophages, which might be more accurate than the labels that were "forced" into the adipocyte pigeonholes.
annotFig1 = DimPlot(adp,group.by="adipocyte",label=T)+NoLegend()
annotFig2 = DimPlot(adp,group.by="mouseRNASeq.main")
annotFig1|annotFig2
Version 2: Cluster-level cell type annotation
Much like pseudobulk differential expression, the RNA expression can be
collapsed into pre-defined components, such as the clusters, if it is
believed that cell-to-cell variation is inducing too much confusion in
the labeling. This collapsing can be performed using either the
AggregateExpression
or AverageExpression
functions, as seen
previously. The resulting matrix can then be fed into SingleR to produce
the cluster-based annotations.
Idents(adp)="SCT_snn_res.0.1" #Assign clusters as the identities
avgExp = AverageExpression(adp, assays="SCT")$SCT #Run AverageExpression on the SCT assay and return only SCT
clustAnnot=SingleR(test=avgExp, ref=mouseRNASeq, labels=mouseRNASeq$label.main) #Run SingleR on the averaged expression matrix
clustAnnot
DataFrame with 8 rows and 4 columns
scores labels delta.next pruned.labels
<matrix> <character> <numeric> <character>
g0 0.769608:0.406627:0.311650:... Fibroblasts 0.0845489 Fibroblasts
g1 0.814336:0.425592:0.304490:... Fibroblasts 0.1485983 Fibroblasts
g2 0.759363:0.389728:0.317296:... Fibroblasts 0.0755571 Fibroblasts
g3 0.821708:0.418202:0.304515:... Adipocytes 0.0533554 Adipocytes
g4 0.834166:0.416754:0.310953:... Adipocytes 0.1534154 Adipocytes
g5 0.728947:0.368267:0.359848:... Fibroblasts 0.0653320 Fibroblasts
g6 0.570283:0.210928:0.542144:... Macrophages 0.1005743 Macrophages
g7 0.828192:0.385496:0.314414:... Adipocytes 0.0671988 Adipocytes
As before, most of the labels are assigned to either fibroblasts and adipocytes. The data takes a little massaging to assign to the scRNASeq metadata, but it is doable.
clustLabels = as.vector(clustAnnot$pruned.labels) #retrieve only the cluster-derived annotations
names(clustLabels) = c(0:7) #assign the cluster numbers as the annotations
clustLabels.vect = clustLabels[match(adp$SCT_snn_res.0.1, names(clustLabels))] #match the cluster identities per cell in the Seurat data to the cluster labels
names(clustLabels.vect) = colnames(adp) #ensure that the cluster identities are assigned the cell names
adp$mouseRNASeq.main.clust = clustLabels.vect #add the cluster annotations to the vector
clustAnnotFig1 = DimPlot(adp,group.by="SCT_snn_res.0.1",label=T)+NoLegend()
clustAnnotFig2 = DimPlot(adp,group.by="adipocyte",label=T)+NoLegend()
clustAnnotFig3 = DimPlot(adp,group.by="mouseRNASeq.main")
clustAnnotFig4 = DimPlot(adp,group.by="mouseRNASeq.main.clust")
(clustAnnotFig1|clustAnnotFig2)/(clustAnnotFig3|clustAnnotFig4)
Generally, the annotation between the per-cell annotation coincides with the per-cluster annotation. This image is also somewhat cleaner, as the cell-to-cell variation is superseded into the clusters. However, this also provides an example where this "collapsing" process can lead to smaller populations being overlooked: The cluster that was labeled as endothelial cells (at \~ [-10,-8]) was relabeled as fibroblasts, due to its initial inclusion as part of cluster 5.
5. Conclusions
Single cell RNASeq is a remarkably powerful tool for analyzing populations of cells that can be recovered from various experiments. Clustering and cell type annotation can be used to distinguish different populations with a level of specificity not otherwise available in bulk sequencing methods. Differential expression can then be applied to identify critical genes that define and classify the cell types. Inversely, differential expression can then be used to identify and classify the subpopulations, especially where cell type annotation falls short of the known biology. In either case, allow the biology to influence the bioinformatics, and use the bioinformatics to expand the biology.
Happy (gene) hunting!
Acknowledgements:
The following sources inspired this content:
This is only a small subset of tools available to single cell RNASeq. For further information and tools, please feel free to reach out to CCBR or BTEP.