Getting Started with Seurat: Differential Expression and Classification

1. Introduction and Learning Objectives

This tutorial has been designed to demonstrate common secondary analysis steps in a scRNA-Seq workflow. We will start with a merged Seurat Object with multiple data layers representing multiple samples that have already been filtered and undergone preliminary clustering.

Throughout this tutorial we will

Explore setting and visualizing identities in a single cell dataset\
Perform differential expression analysis through Seurat\
Use differentially expressed genes to classify cells\
Run a case test of cell type annotation using SingleR

This tutorial largely follows the standard unsupervised clustering workflow by Seurat and the differential expression testing vignette, with slight deviations and a different data set.

Warning

This document in no way represents a complete analysis. Here, we are demonstrating many of the standard steps in Seurat. Results here would be considered preliminary, requiring greater care and consideration of biology.

The data

In this tutorial, we will continue to use data from Nanduri et al. 2022, Epigenetic regulation of white adipose tissue plasticity and energy metabolism by nucleosome binding HMGN proteins, published in Nature Communications.

As a reminder, this study examined "the role of HMGNs in white adipocyte browning by comparing wild-type (WT) mice and cells to genetically derived mice and cells lacking the two major members of the HMGN protein family, HMGN1 and HMGN2 (DKO mice).". The experimental design included WT vs DKO mice, 2 replicates each, at two time points, 0 and 6 days. At day 0, cells were in a preadipocyte state, while at day 6 they had differentiated into adipocytes. Each time point had two replicates of each condition (WT vs DKO), for a total of 8 samples.

An R project directory containing these data and associated files can be downloaded here.

Consider the biology in your analysis.

The biology of your experiment will inform your analysis. While scRNA-Seq can be exploratory, you should have some general idea of the cell types that you expect. Knowing some characteristics of expected cells, for example, cell size or whether cells are complex or more energetically active is helpful for data processing.

For example, in this experiment, we expect cells transitioning from preadipocytes to adipocytes and then beige adipocytes.

Load the packages

library(tidyverse) # dplyr and ggplot2; CRAN
library(Seurat) # Seurat toolkit; CRAN
library(hdf5r) # for data import; CRAN
library(patchwork) # for plotting; CRAN 
library(presto) # for differential expression; Github
library(glmGamPoi) # for sctransform; Bioconductor
library(ggplot2) # for visualization; installed with tidyverse
library(dplyr) #just to make sure it was called; installed with tidyverse


library(SingleR) # for cell type annotation; Bioconductor
library(celldex) # for cell type annotation reference; Bioconductor
library(MAST) # for differential expression; Bioconductor

Load the Seurat Object

Here, we will start with the data stored in a Seurat object. For instructions on data import and creating the object, see an Introduction to scRNA-Seq with R (Seurat) and Getting Started with Seurat: QC to Clustering.

adp <- readRDS("../outputs/adp_merge_filt_sctran_clust0.1.rds")
mem.maxVSize()

[1] 16384

mem.maxVSize(vsize=16384*4)

[1] 65536

Tip

Many of the steps used in differential expression can be very memory-intensive, and on some computers, this could lead to errors or warnings that look like this:

Found 8 SCT models. Recorrecting SCT counts using minimum median counts: 8146
  |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=01m 58s
Error: vector memory limit of 16.0 Gb reached, see mem.maxVSize()
Error during wrapup: vector memory limit of 16.0 Gb reached, see mem.maxVSize()
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

This particular error indicates that there isn't enough virtual memory allocated to run a process (in this case, PrepSCTFindMarkers). This can be corrected by adjusting the memory that R is allowed to access.

mem.maxVSize()
#[1] 16384
mem.maxVSize(vsize=16384*4)
#[1] 65536

For more details, please see the documentation on memory allocation.

Examine the object:

glimpse(adp)

Formal class 'Seurat' [package "SeuratObject"] with 13 slots
  ..@ assays      :List of 2
  .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
  .. ..$ SCT:Formal class 'SCTAssay' [package "Seurat"] with 9 slots
  ..@ meta.data   :'data.frame':    42114 obs. of  11 variables:
  .. ..$ orig.ident     : chr [1:42114] "D10" "D10" "D10" "D10" ...
  .. ..$ nCount_RNA     : num [1:42114] 11188 32832 28515 18954 27181 ...
  .. ..$ nFeature_RNA   : int [1:42114] 3383 5823 5002 4305 4586 5015 3927 4006 5489 4093 ...
  .. ..$ condition      : chr [1:42114] "DKO" "DKO" "DKO" "DKO" ...
  .. ..$ time_point     : chr [1:42114] "Day 0" "Day 0" "Day 0" "Day 0" ...
  .. ..$ cond_tp        : chr [1:42114] "DKO Day 0" "DKO Day 0" "DKO Day 0" "DKO Day 0" ...
  .. ..$ percent.mt     : num [1:42114] 3.16 1.8 1.39 1.89 1.33 ...
  .. ..$ nCount_SCT     : num [1:42114] 23551 25705 25491 23790 25367 ...
  .. ..$ nFeature_SCT   : int [1:42114] 3843 5809 5002 4306 4586 5015 3967 4038 5425 4181 ...
  .. ..$ SCT_snn_res.0.1: Factor w/ 8 levels "0","1","2","3",..: 8 1 1 1 1 1 8 1 3 7 ...
  .. ..$ seurat_clusters: Factor w/ 8 levels "0","1","2","3",..: 8 1 1 1 1 1 8 1 3 7 ...
  ..@ active.assay: chr "SCT"
  ..@ active.ident: Factor w/ 8 levels "0","1","2","3",..: 8 1 1 1 1 1 8 1 3 7 ...
  .. ..- attr(*, "names")= chr [1:42114] "D10_AAACCCAAGATGCTTC-1" "D10_AAACCCAAGCTCTTCC-1" "D10_AAACCCAAGTCATCGT-1" "D10_AAACCCAGTAGCGTCC-1" ...
  ..@ graphs      :List of 2
  .. ..$ SCT_nn :Formal class 'Graph' [package "SeuratObject"] with 7 slots
  .. ..$ SCT_snn:Formal class 'Graph' [package "SeuratObject"] with 7 slots
  ..@ neighbors   : list()
  ..@ reductions  :List of 2
  .. ..$ pca :Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
  .. ..$ umap:Formal class 'DimReduc' [package "SeuratObject"] with 9 slots
  ..@ images      : list()
  ..@ project.name: chr "Adipose"
  ..@ misc        : list()
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 5 0 1
  ..@ commands    :List of 5
  .. ..$ SCTransform.RNA      :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
  .. ..$ RunPCA.SCT           :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
  .. ..$ FindNeighbors.SCT.pca:Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
  .. ..$ FindClusters         :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
  .. ..$ RunUMAP.SCT.pca      :Formal class 'SeuratCommand' [package "SeuratObject"] with 5 slots
  ..@ tools       : list()

Reviewing the Data

As we can see above, the glimpse command shows the metadata that can be used to classify the cells. Within Seurat, the metadata is used to define the "identity" of the dataset. This is critical, as the labels within the categories will be used to determine which groups of cells are being compared. One method of viewing the available labels is to use the table function, which lists the frequency of each label:

table(adp$cond_tp)

DKO Day 0 DKO Day 6  WT Day 0  WT Day 6 
    11655     10512     11807      8140

The table function can also be used in two dimensions:

table(adp$cond_tp,adp$orig.ident)

             D10  D16  D20  D26  W10  W16  W20  W26
  DKO Day 0 6990    0 4665    0    0    0    0    0
  DKO Day 6    0 4674    0 5838    0    0    0    0
  WT Day 0     0    0    0    0 6780    0 5027    0
  WT Day 6     0    0    0    0    0 3781    0 4359

Before running any sort of differential expression, the identity of the Seurat object needs to be explicitly defined. Different identities can be assigned, based on what needs to be analyzed. Recall this visualization of the original identities, clusters, time point, experimental condition, and combined time point and condition:

DimPlot(adp, reduction = "umap", group.by = c("orig.ident", "seurat_clusters","time_point","condition","cond_tp"),
        alpha=0.4, ncol=2)

Here we will set the identity to the clusters, and view the frequency of the resulting cell identities:

Idents(adp) = "SCT_snn_res.0.1"
table(Idents(adp))

    0     1     2     3     4     5     6     7 
16954  7635  6731  6481  2326  1085   556   346

Alternate identity classification

If there are identities defined in another vector, the vector values can be explicitly assigned to the dataset, without adding the identities to the metadata itself:

clusters=adp$SCT_snn_res.0.1
Idents(adp)=clusters

However, if the identities are overwritten, you will need to call the vector again, unless the vector is added to the metadata.

2. Differential Expression

What is differential expression analysis?

Differential expression analysis is the process of identifying genes that have a significant difference in expression between two or more groups. For many sequencing experiments, regardless of methodology, differential analysis lays the foundation of the results and any biological interpretation.

Different tools can be used for differential gene expression analysis, depending on the underlying statistics. One of the best known statistical tests is the Student's t-test, which tests for how likely it is that two sets of measurements come from the same normal distribution. However, this does assume that the data is normally distributed, meaning that the data has the distinctive bell curve shape. Should this assumption be inconsistent, a non-parametric version of this test is the Wilcoxon rank-sum test, which tests how likely it is that one population is larger than the other, regardless of probability distribution size. Alternative distribution shapes can include bimodal or multimodal distributions, with multiple distinct peaks, and skewed distributions with data that favors one side or another (examples).

For many sequencing experiments, the assumption of normality is rarely guaranteed. Some of the more popular tools for bulk RNASeq experiments, such as DESeq2, limma, and edgeR, acknowledge this, and use different statistical models to identify and interpret differences in gene expression.

Single cell RNASeq is notorious for not having a normal distribution of gene expression. For technical reasons, scRNASeq has a transcript recovery rate between 60%-80%, which naturally skews the gene expression towards the non-expressed end.

plot(density(sample(JoinLayers(adp@assays$RNA)$count["Gapdh",],2500)),cex=0,lty=1, main="Density of Gapdh in 2500 random cells")

hist(sample(JoinLayers(adp@assays$RNA)$count["Gapdh",],2500),breaks=99,main="Histogram of Gapdh in 2500 random cells",ylab="Frequency",xlab="Gene counts")

Note

Viewing the code, JoinLayers was called to link all the data sets together. This is a new feature of Seurat5, and is required for analyzing data after integration and batch correction (ref).

The Seurat tool acknowledges this, and by default uses the Wilcoxon rank-sum test to identify differentially expressed genes, via the presto package. Another algorithm that is often used is MAST, which implements a hurdle model to determine the likelihood that a gene is genuinely not expressed in a cell, or if it is more likely a technical dropout.

Running Differential Expression in Seurat

The primary means of running differential expression in Seurat is through the FindMarkers function. The main usage for this function is as follows:

FindMarkers(object,ident.1= ..., ident.2=..., test.use="wilcox", min.pct = 0.01, logfc.threshold = 0.1)

object: The Seurat object being examined
ident.1: The first or main label in the comparison. Positive fold change values indicate upregulation in ident.1
ident.2: The second label in the comparison. If ident.2 is not defined, it is automatically set to all remaining cells in the comparison. In contrast to ident.1, negative fold changes indicated upregulation in ident.2.
test.use: The statistical test that is used to compare genes between ident.1 and ident.2. The default values is the Wilcoxon rank-sum test
min.pct: The minimum fraction of cells in at least one of the two groups that must express the gene in order to be compared. If this value is not exceeded in both populations, the gene is excluded from the differential expression analysis.
logfc.threshold: The minimum difference in average fold change between the two populations before the gene is evaluated. If the gene is below the threshold, i.e. the expression between the populations is too similar, the gene will excluded from the DE analysis.

In the event that a user wants to recover DE statistics for all genes, min.pct and logfc.threshold can be set to 0. Additional parameters, including other statistical tests, can be found in the Help menu for FindMarkers.

Prepare the dataset for analysis

In Seurat (since version 4), differential analysis requires a preprocessing step to appropriately scale the normalized SCTransform assay across samples:

adp = PrepSCTFindMarkers(adp)

Found 8 SCT models. Recorrecting SCT counts using minimum median counts: 8146

As covered earlier, the identities of the Seurat object will also need to be defined. To start, we will use the clusters that were generated in the previous workshop:

Idents(adp) = "SCT_snn_res.0.1"

It is also strongly recommended to explicitly set the SCT assay as the basis for the differential expression. Earlier workflows that do not use SCTransform for normalizaton use log-normalization and scaling on the RNA assay, and should be treated as such.

DefaultAssay(adp) = "SCT"

Setting up differential expression on a different assay

In the event that a different assay is being used for differential expression (protein expression, ATACSeq data, initial RNA data), the assay is defined with the same command. In this case, the PrepSCTFindMarkers function will not need to be run.

DefaultAssay(adp) = "RNA"

or

DefaultAssay(adp)="CITESeq"

Warning

Under no circumstances should the batch corrected or integrated assay be used for differential expression. The batch correction is used primarily for dimensionality reduction, UMAP/T-SNE visualization, and clustering. In order to optimize this process, only a select subset of highly variable genes (~3000) is used, and this is reflected in the size of the assay. If differential expression is run on this assay, the differences will be accentuated, and only return the ~3000 genes that were selected for batch correction.

2a. FindAllMarkers

The FindAllMarkers function is particularly useful in identifying the differentially expressed genes that distinguish several groups, such as seen here in the clusters. What makes this unique is that none of the identities are initially defined, as each identity will be compared with the rest of the cohort.

de_allClusters = FindAllMarkers(adp, test.use="wilcox", min.pct=0.1, only.pos=TRUE)

Calculating cluster 0

Calculating cluster 1

Calculating cluster 2

Calculating cluster 3

Calculating cluster 4

Calculating cluster 5

Calculating cluster 6

Calculating cluster 7

Note

For expediency, this is run with a stricter min.pct threshold on the default Wilcoxon rank-sum test, and turns on the function-specific parameter only.pos, which is used to return only positively expressed markers from each cluster, as opposed to markers that are both significantly upregulated and downregulated.

We can peek at the results using the following commands:

head(de_allClusters)

       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster   gene
Emp3       0   1.437535 0.866 0.290         0       0   Emp3
Tagln      0   1.647233 0.852 0.299         0       0  Tagln
Mfap5      0   1.877743 0.958 0.418         0       0  Mfap5
Igfbp6     0   2.028028 0.806 0.281         0       0 Igfbp6
Cd9        0   1.678611 0.760 0.261         0       0    Cd9
Acta2      0   1.559173 0.792 0.295         0       0  Acta2

nrow(de_allClusters)

[1] 25435

The head function lets us look at the first 10 of so lines of the table, where we can see the genes are in the rows, with relevant statistics. The nrow function then tells us how many genes have been compared that matched the logfc.threshold and min.pct criteria. It should be noted that these criteria do not necessarily mean that only the significant genes have been returned; the table may still contain genes that are not significantly dysregulated.

The differential expression table from FindAllMarkers contains the following statistics:

p_val: The raw p-value as calculated by the chosen algorithm.
avg_log2FC: The average log2 fold change between the first group and the second group. In this case, it is the cluster in question and the remainder of the cells, respectively.
pct.1: The fraction of cells in the first group (cluster in question) that express the gene.
pct.2: The fraction of cells in the second group (all other cells) that express the gene.
p_val_adj: The adjusted p-value, as determined using False Discovery Rate (also known as Benjamini-Hochberg) algorithm.
cluster: The cluster that was being evaluated as "Population 1". - gene: Reiterates the gene that is being evaluated. This is necessary because R requires unique row names.

Here we will take a look at the top 5 positively differentially expressed genes of each cluster:

top5PerCluster = matrix(ncol=7)
colnames(top5PerCluster) = colnames(de_allClusters)
for (i in 0:7){
  top5PerCluster = rbind(top5PerCluster, head(de_allClusters[which(de_allClusters$cluster==i),], 5))
}
top5PerCluster=top5PerCluster[-1,]
top5PerCluster

         p_val avg_log2FC pct.1 pct.2 p_val_adj cluster    gene
Emp3         0  1.4375351 0.866 0.290         0       0    Emp3
Tagln        0  1.6472332 0.852 0.299         0       0   Tagln
Mfap5        0  1.8777432 0.958 0.418         0       0   Mfap5
Igfbp6       0  2.0280281 0.806 0.281         0       0  Igfbp6
Cd9          0  1.6786106 0.760 0.261         0       0     Cd9
Ifi207       0  1.8002046 0.751 0.370         0       1  Ifi207
A2m          0  1.8799582 0.613 0.238         0       1     A2m
Folh1        0  2.5556965 0.458 0.085         0       1   Folh1
Scara5       0  3.0511756 0.427 0.056         0       1  Scara5
Fzd1         0  1.5380712 0.597 0.247         0       1    Fzd1
Mki67        0  4.8278546 0.930 0.071         0       2   Mki67
Pclaf        0  4.7049029 0.935 0.078         0       2   Pclaf
Birc5        0  4.6066362 0.929 0.077         0       2   Birc5
Tpx2         0  4.6715031 0.899 0.079         0       2    Tpx2
Top2a        0  4.5592408 0.927 0.127         0       2   Top2a
Car3         0  0.4065029 0.917 0.185         0       3    Car3
A2m1         0  1.7465643 0.906 0.197         0       3     A2m
Cd361        0  1.1648713 0.957 0.252         0       3    Cd36
Cdo1         0  1.2141175 0.913 0.238         0       3    Cdo1
Pnpla2       0  1.4005127 0.934 0.290         0       3  Pnpla2
Lipe1        0  4.2099355 0.947 0.156         0       4    Lipe
Plin11       0  4.0682072 0.945 0.159         0       4   Plin1
Arxes11      0  3.9698014 0.899 0.126         0       4  Arxes1
Slc36a21     0  4.4037697 0.833 0.069         0       4 Slc36a2
Pcx1         0  3.7673292 0.890 0.148         0       4     Pcx
Tm4sf12      0  1.9998484 0.836 0.280         0       5  Tm4sf1
Pdgfb        0  6.0929653 0.180 0.005         0       5   Pdgfb
Upp1         0  6.8186440 0.161 0.003         0       5    Upp1
Cdh5         0  7.7923303 0.157 0.007         0       5    Cdh5
Pecam1       0  8.0957976 0.150 0.003         0       5  Pecam1
Tyrobp       0 11.2901172 0.996 0.006         0       6  Tyrobp
C1qc         0 11.4203660 0.996 0.007         0       6    C1qc
Fcer1g       0 11.0144789 1.000 0.012         0       6  Fcer1g
Trem2        0 10.4536558 0.993 0.006         0       6   Trem2
Ctss         0 10.9417908 0.991 0.005         0       6    Ctss
Wfdc211      0  7.6043603 0.850 0.035         0       7  Wfdc21
Lrg11        0  5.2555628 0.974 0.169         0       7    Lrg1
Plin41       0  3.7513090 0.893 0.115         0       7   Plin4
Slc5a31      0  4.3232241 0.893 0.160         0       7  Slc5a3
Cfd1         0  9.0781154 0.749 0.043         0       7     Cfd

This can be visualized with a heatmap:

DoHeatmap(adp,features = top5PerCluster$gene,slot="scale.data")

Warning in DoHeatmap(adp, features = top5PerCluster$gene, slot = "scale.data"):
The following features were omitted as they were not found in the scale.data
slot for the SCT assay: Ctss, C1qc, Tyrobp, Pdgfb

Tip

Here we are specifying that we are using the scale.data slot in the SCT assay. During the SCTransform process, the SCT assay has the slot for scale.data filled; however, this may not contain all genes as it contains only the most highly variable genes. In the example above, Ctss, C1qc, Tyrobp, and Pdgfb have been excluded for this reason. In order to ensure that all genes are represented, the data slot can be directly called as the data reference, but it will not be z-scaled.

2b. FindMarkers

Many experiments look to compare to distinct populations, and scRNASeq is no exception. The two populations being compared can vary wildly from experiment to experiment; some look to draw comparisons based on the experimental condition, while others may look to compare different clusters within the experiment itself.

Here, we will be drawing a comparison between the two timepoints. First, we need to set the identities of the object:

Idents(adp) = "time_point"
table(Idents(adp))

Day 0 Day 6 
23462 18652

And then we set up FindMarkers to compare our two conditions:

day6_day0_de=FindMarkers(adp,ident.1="Day 6",ident.2="Day 0",test.use="wilcox")
head(day6_day0_de, 10)

       p_val avg_log2FC pct.1 pct.2 p_val_adj
Emp3       0  -5.413609 0.060 0.889         0
Tagln      0  -5.992922 0.080 0.873         0
Acta2      0  -6.098134 0.079 0.825         0
Tpm2       0  -5.040596 0.084 0.805         0
Cd36       0   5.349745 0.762 0.042         0
A2m        0   9.465772 0.687 0.004         0
Cd9        0  -4.134289 0.082 0.764         0
Rbp4       0   3.769205 0.841 0.160         0
Mfap5      0  -3.605865 0.257 0.937         0
Igfbp6     0  -3.786182 0.115 0.792         0

As you can see, most of the column headers are the same as the results from FindAllMarkers; in fact, there are fewer because the system does not need to account for which cluster is being compared against the rest and all the gene IDs should be unique. As mentioned previously, ident.1 will have the placeholder of the positive fold change. Another visualization of the differential expression can be performed using the FeaturePlot function:

fig1 = DimPlot(adp,group.by="time_point")
fig2 = FeaturePlot(adp,features="Acta2",order=T)
fig3 = FeaturePlot(adp,features="Cd36",order=T)

fig1/(fig2|fig3)

2c. Pseudobulk Differential Expression

One particular critique of differential expression in single cell RNASeq analysis is p-value "inflation," where the p-values get so small that there are far too many genes exist with p-values below 0.05, even after adjustment. This is caused by the nature of scRNASeq, where each individual cell essentially consists of a single replicate; as the replicate count increases, the p-values tend to shrink. Generally, we recommend using the DE values as a guide for significant differential gene identification, as opposed to using the p-value for a hard cutoff. The p-value, which is an expression of a probability of a real result, should be examined in context with the remaining statistics, such as fraction expression (pct.1 and pct.2) and fold change. The entire picture of the statistical comparison will aid in determining which events are likely to be real and therefore actionable pursuits.

One method to counter the p-value inflation is to run a pseudobulk analysis. In this process, the total gene counts for each population of interest is aggregated across cells, and then these total counts can be treated in the same manner as a bulk RNASeq experiment. Through this process, the replicate inflation is overcome, as well as any cell-to-cell variation that may be excessively skewing results.

Here, the pseudobulk DE is run between timepoints, as before. It starts with the AggregateExpression function in Seurat, and collapsing the counts by replicate. Be sure to include all potential metadata that will be of interest.

pseudo_adp=AggregateExpression(adp,assays="RNA", return.seurat=T, group.by=c("orig.ident","time_point","condition","cond_tp"))

head(pseudo_adp@assays$RNA$counts)

6 x 8 sparse Matrix of class "dgCMatrix"
        D10_Day 0_DKO_DKO Day 0 D16_Day 6_DKO_DKO Day 6 D20_Day 0_DKO_DKO Day 0
Xkr4                        340                      27                     239
Gm19938                     338                      60                     188
Sox17                       128                       9                      61
Mrpl15                    21040                    5015                   10944
Lypla1                     4339                    1779                    3024
Tcea1                     31828                    4905                   15522
        D26_Day 6_DKO_DKO Day 6 W10_Day 0_WT_WT Day 0 W16_Day 6_WT_WT Day 6
Xkr4                         31                   411                    33
Gm19938                      71                   302                    55
Sox17                         .                   128                     .
Mrpl15                     7359                 16171                  4475
Lypla1                     2608                  4610                  1709
Tcea1                      7095                 26408                  4715
        W20_Day 0_WT_WT Day 0 W26_Day 6_WT_WT Day 6
Xkr4                      278                    37
Gm19938                   196                    82
Sox17                      31                     5
Mrpl15                  12725                  5861
Lypla1                   3020                  1995
Tcea1                   17656                  6041

pseudo_adp@meta.data

                                     orig.ident time_point condition   cond_tp
D10_Day 0_DKO_DKO Day 0 D10_Day 0_DKO_DKO Day 0      Day 0       DKO DKO Day 0
D16_Day 6_DKO_DKO Day 6 D16_Day 6_DKO_DKO Day 6      Day 6       DKO DKO Day 6
D20_Day 0_DKO_DKO Day 0 D20_Day 0_DKO_DKO Day 0      Day 0       DKO DKO Day 0
D26_Day 6_DKO_DKO Day 6 D26_Day 6_DKO_DKO Day 6      Day 6       DKO DKO Day 6
W10_Day 0_WT_WT Day 0     W10_Day 0_WT_WT Day 0      Day 0        WT  WT Day 0
W16_Day 6_WT_WT Day 6     W16_Day 6_WT_WT Day 6      Day 6        WT  WT Day 6
W20_Day 0_WT_WT Day 0     W20_Day 0_WT_WT Day 0      Day 0        WT  WT Day 0
W26_Day 6_WT_WT Day 6     W26_Day 6_WT_WT Day 6      Day 6        WT  WT Day 6

#just to clean up the look a little bit
pseudo_adp = RenameCells(pseudo_adp,new.names=gsub("_.*","",pseudo_adp$orig.ident))
pseudo_adp$orig.ident=gsub("_.*","",pseudo_adp$orig.ident)
head(pseudo_adp@assays$RNA$counts)

6 x 8 sparse Matrix of class "dgCMatrix"
          D10  D16   D20  D26   W10  W16   W20  W26
Xkr4      340   27   239   31   411   33   278   37
Gm19938   338   60   188   71   302   55   196   82
Sox17     128    9    61    .   128    .    31    5
Mrpl15  21040 5015 10944 7359 16171 4475 12725 5861
Lypla1   4339 1779  3024 2608  4610 1709  3020 1995
Tcea1   31828 4905 15522 7095 26408 4715 17656 6041

pseudo_adp@meta.data

    orig.ident time_point condition   cond_tp
D10        D10      Day 0       DKO DKO Day 0
D16        D16      Day 6       DKO DKO Day 6
D20        D20      Day 0       DKO DKO Day 0
D26        D26      Day 6       DKO DKO Day 6
W10        W10      Day 0        WT  WT Day 0
W16        W16      Day 6        WT  WT Day 6
W20        W20      Day 0        WT  WT Day 0
W26        W26      Day 6        WT  WT Day 6

The data now is a matrix of counts by sample, as opposed to counts by cell. DESeq2 may now be run on the pseudobulk data:

Idents(pseudo_adp)="time_point"
bulk_adp_de = FindMarkers(pseudo_adp, ident.1="Day 6", ident.2="Day 0", test.use="DESeq2")

converting counts to integer mode

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

head(bulk_adp_de)

       p_val avg_log2FC pct.1 pct.2 p_val_adj
Nabp1      0   2.249765     1     1         0
Prelp      0  -2.519092     1     1         0
Arl8a      0  -1.180647     1     1         0
Glul       0   2.291869     1     1         0
Fcer1g     0  -1.648088     1     1         0
Atp1a2     0   1.528571     1     1         0

Comparing the results between the single cell and pseudobulk analysis shows that there are more genes identified as significantly dysregulated via scRNASeq DE; the overall concordance sits at about 70% of the pseudobulk genes.

scDE.genes = rownames(day6_day0_de)[which(day6_day0_de$p_val_adj<0.05)]
bulkDE.genes = rownames(bulk_adp_de)[which(bulk_adp_de$p_val_adj<0.05)]
length(scDE.genes)

[1] 10337

length(bulkDE.genes)

[1] 9201

length(intersect(scDE.genes,bulkDE.genes))

[1] 6779

Checking our two targeted genes above shows that the pseudobulk analysis still identifies them as significantly differentially expressed:

bulk_adp_de[c("Acta2","Cd36"),]

              p_val avg_log2FC pct.1 pct.2     p_val_adj
Acta2 1.769929e-268  -5.490767     1     1 3.655611e-264
Cd36  1.986865e-126   4.560422     1     1 4.103671e-122

3. Visualizing Differentially Expressed Genes

In addition to the Heatmap and the FeaturePlot shown previously, two other options easily accessible through Seurat are the Dot Plot and Violin Plot. The dot plot can visualize relative expression level and expression fraction. Using the same genes from the cluster-derived DE results:

Idents(adp)="SCT_snn_res.0.1"
DotPlot(adp,features=unique(top5PerCluster$gene),dot.scale = 3)+coord_flip()

The violin plot is another way to visualize top genes across a smaller subset of genes. Taking the two genes from the timepoint analysis:

Idents(adp) = "time_point"
VlnPlot(adp,features=c("Acta2","Cd36"),alpha = 0.1)

Note

Striation is normal. The SCTransform technique normalizes the counts themselves at the per-gene and per-sample levels, while the older log-normalization/scaling approach created a more continuous count distribution.

4. Cell Annotation

4a. Annotation by Differential Expressed Gene Markers

From the paper where this data was obtained, the following (incomplete) list of gene markers was obtained:

Mmp3: preadipocytes
Mki67: proliferating cells
Fabp4: differentiating beige adipocytes and differentiated beige adipocytes
Scd1: differentiated beige adipocytes
Ucp1: differentiated beige adipocytes
Ppargc1a: differentiated beige adipocytes
Elovl3: differentiated beige adipocytes
Cidea: differentiated beige adipocytes

Looking at the average expression of these genes across the clusters can provide insight into annotating the cells

markers=c("Mmp3","Mki67","Fabp4","Scd1","Ucp1","Ppargc1a","Elovl3","Cidea")
Idents(adp)="SCT_snn_res.0.1"
avgExp = AverageExpression(adp,markers,assay="SCT")$SCT

As of Seurat v5, we recommend using AggregateExpression to perform pseudo-bulk analysis.
First group.by variable `ident` starts with a number, appending `g` to ensure valid variable names
This message is displayed once per session.

avgExp

8 x 8 sparse Matrix of class "dgCMatrix"
                   g0           g1           g2           g3          g4
Mmp3     1.007001e+01  2.666666667  2.024810578   0.63200123   0.1474635
Mki67    5.520821e-02  0.081859856  5.157777448   0.38975467   0.2622528
Fabp4    3.586233e+00 50.806941716 20.006834051 158.68677673 450.8907997
Scd1     7.791082e-01  7.064178127  2.212895558  26.46952631  76.7214101
Ucp1     1.179663e-04  0.008382449  0.002971327   0.01481253   0.1294067
Ppargc1a 4.895600e-03  0.039816634  0.011142475   0.10862521   0.7923474
Elovl3   1.828477e-03  0.064047151  0.023919180   0.17636167   1.2682717
Cidea    2.595258e-03  0.041519319  0.014410934   0.12976393   0.6977644
                   g5          g6           g7
Mmp3      2.429493088 1.446043165   1.07803468
Mki67     0.459907834 1.937050360   0.43930636
Fabp4    11.935483871 2.183453237 137.46531792
Scd1      1.003686636 0.208633094  70.86416185
Ucp1      .           .             .         
Ppargc1a  0.003686636 .             0.05202312
Elovl3    0.005529954 .             .         
Cidea     0.007373272 0.001798561   .

DimPlot(adp,label=T)

FeaturePlot(adp,features=markers,ncol=3,order=T)

Comparing the genes to the clusters, the following correlations are apparent:

Cluster 0: Mmp3
Cluster 1: Moderate Fabp4, slight Scd1
Cluster 2: Mki67, mild Fabp4
Cluster 3: Moderate Fabp4, Moderate Scd1
Cluster 4: Fabp4, Scd1, Ucp1, Ppargc1a, Cidea
Cluster 5: Moderate Mmp3, slight Fabp4
Cluster 6: Moderate Mki67
Cluster 7: Fabp4, Scd1

Based on these gene markers, the following cell identities could be assigned:

Preadipocytes: Clusters 0 and 5
Proliferating cells: Clusters 2 and 6
Differentiating beige adipocytes: Cluster 1 and 3
Differentiated beige adipocytes: Cluster 4
Unclassified: Cluster 7

These can be relabeled via vector assignment:

adipocyte=vector(length=ncol(adp))
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(0,5))]="Preadipocytes"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(2,6))]="Proliferating cells"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(1,3))]="Differentiating beige adipocytes"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(4))]="Differentiated beige adipocytes"
adipocyte[which(adp$SCT_snn_res.0.1 %in% c(7))]="Unclassified"
adp$adipocyte = adipocyte

f1 = DimPlot(adp,group.by="SCT_snn_res.0.1",label=T) + NoLegend()
f2 = DimPlot(adp,group.by="time_point") + NoLegend()
f3 = DimPlot(adp,group.by="adipocyte",label=T)+NoLegend()

(f1|f2)/f3

4b. Annotation using SingleR

Some tools have been described in the previous session (see here). Today, we will be focusing on the SingleR tool, which also requires the celldex package. In short, SingleR operates by comparing your current dataset against a reference dataset, mostly through correlative gene expression. It then assigns a score for the expression set (i.e. each cell or cluster being compared) for all possible reference labels. This walkthrough will explore using one of the default references from celldex to try to assign annotations to our current dataset.

SingleR requires that the data be mildly transformed into a different data structure. It is also slightly more efficient to retrieve and store the reference dataset from celldex preemptively. This also gives the opportunity to peek into the reference data itself.

adp.sce = as.SingleCellExperiment(adp,assay="SCT") #This selects *only* the SCT assay
mouseRNASeq = celldex::MouseRNAseqData()
head(mouseRNASeq)

class: SummarizedExperiment 
dim: 6 358 
metadata(0):
assays(1): logcounts
rownames(6): Xkr4 Rp1 ... Lypla1 Tcea1
rowData names(0):
colnames(358): ERR525589Aligned ERR525592Aligned ... SRR1044043Aligned
  SRR1044044Aligned
colData names(3): label.main label.fine label.ont

table(mouseRNASeq$label.main)

       Adipocytes        Astrocytes           B cells    Cardiomyocytes 
               13                27                 5                 8 
  Dendritic cells Endothelial cells  Epithelial cells      Erythrocytes 
                2                12                 2                 3 
      Fibroblasts      Granulocytes       Hepatocytes       Macrophages 
               45                15                 4                32 
        Microglia         Monocytes           Neurons          NK cells 
               72                 6                64                18 
 Oligodendrocytes           T cells 
               12                18

table(mouseRNASeq$label.fine)

           Adipocytes                 aNSCs            Astrocytes 
                   13                     8                    24 
 Astrocytes activated               B cells        Cardiomyocytes 
                    3                     5                     8 
      Dendritic cells     Endothelial cells             Ependymal 
                    2                    12                     2 
         Erythrocytes           Fibroblasts Fibroblasts activated 
                    3                    27                     9 
Fibroblasts senescent          Granulocytes           Hepatocytes 
                    9                    15                     4 
          Macrophages Macrophages activated             Microglia 
                   26                     6                    56 
  Microglia activated             Monocytes               Neurons 
                   16                     6                    39 
    Neurons activated              NK cells                  NPCs 
                    4                    18                    11 
     Oligodendrocytes                  OPCs                 qNSCs 
                   10                     2                     2 
              T cells 
                   18

For many of the celldex datasets, there are two separate categories of labels. The main labels capture the family of cell types available (e.g. T cells), while the fine category captures the specific subtypes (e.g. Cd4 effector T cells). The references in celldex are species-specific, which has to be accounted for in any reference-based cell type annotation.

Your annotation is only as good as your reference!

In many cases, one of the default reference datasets from celldex will be a sufficient starting point. However, some studies will require more specific cell types than what can be retrieved from celldex. In these cases, one may consider retrieving a dataset from the Bioconductor package scRNAseq or from the Chan-Zuckerberg Biohub, which has collected and annotated cell types from multiple organs from both mouse (Tabula muris) and human (Tabula sapiens) samples.

Version 1: Cell-level cell type annotation

When using SingleR, the 3 primary parameters are the experimental dataset, the reference dataset, and the labels being used. Continuing with the main labels of the MouseRNASeq dataset on the full dataset looks like this:

annot = SingleR(test=adp.sce, ref=mouseRNASeq, labels=mouseRNASeq$label.main)
head(annot)

DataFrame with 6 rows and 4 columns
                                               scores      labels delta.next
                                             <matrix> <character>  <numeric>
D10_AAACCCAAGATGCTTC-1 0.523325:0.275539:0.210645:...  Adipocytes  0.2110472
D10_AAACCCAAGCTCTTCC-1 0.499954:0.250459:0.226184:... Fibroblasts  0.1624889
D10_AAACCCAAGTCATCGT-1 0.474176:0.213157:0.189314:... Fibroblasts  0.3422910
D10_AAACCCAGTAGCGTCC-1 0.449555:0.245758:0.228256:... Fibroblasts  0.0501651
D10_AAACCCAGTGCACGCT-1 0.398923:0.179533:0.207357:... Fibroblasts  0.0527357
D10_AAACCCATCCACTGAA-1 0.500808:0.250242:0.187568:... Fibroblasts  0.0797964
                       pruned.labels
                         <character>
D10_AAACCCAAGATGCTTC-1    Adipocytes
D10_AAACCCAAGCTCTTCC-1   Fibroblasts
D10_AAACCCAAGTCATCGT-1   Fibroblasts
D10_AAACCCAGTAGCGTCC-1   Fibroblasts
D10_AAACCCAGTGCACGCT-1   Fibroblasts
D10_AAACCCATCCACTGAA-1   Fibroblasts

Tip

This particular step can take a while, since it compares every cell in the dataset against every cell type transcriptome in the reference. One suggestion from the developers is to divide the dataset and run the annotation on each subset independently (in a process called "scatter and gather"). A version of this that works particularly well is to run the cell type annotation on each sample prior to combining the dataset. This isn't as much a problem with this toy dataset, but will become rather pronounced as data sets increase beyond \~75k cells or 10 samples.

The results from SingleR are returned as a dataframe:

scores: Correlation scores for each cell against each of the reference columns. Accessed as annot$scores in this case
labels: Initial labels based on highest scoring cell type annotation
delta.next: The difference in scores between the highest and second-highest scoring annotations
pruned.labels: SingleR determines if a label should be discarded because the delta.next value is too small, i.e. the label can be ambiguous. Any discarded labels will be replaced with NA.

Here, the pruned.labels are added to the meta.data

table(annot$pruned.labels, useNA="ifany") #useNA can be used turned on in the `table` function

       Adipocytes Endothelial cells       Fibroblasts       Macrophages 
            13408               103             27692               480 
             <NA> 
              431

adp$mouseRNASeq.main=annot$pruned.labels

This is actually a great example of the warning above. The dataset is specifically adipocytes, and should only have a smattering of alternative cell types. However, a number of the cells are being returned as a fibroblasts. Visualizing the two sets of annotations reveals that many of the cells identified as proliferating adipocytes or preadipocytes have been labeled as fibroblasts. Interestingly, the two clusters that are more separate from the rest of the main body received alternative annotations as endothelial cells and macrophages, which might be more accurate than the labels that were "forced" into the adipocyte pigeonholes.

annotFig1 = DimPlot(adp,group.by="adipocyte",label=T)+NoLegend()
annotFig2 = DimPlot(adp,group.by="mouseRNASeq.main")

annotFig1|annotFig2

Version 2: Cluster-level cell type annotation

Much like pseudobulk differential expression, the RNA expression can be collapsed into pre-defined components, such as the clusters, if it is believed that cell-to-cell variation is inducing too much confusion in the labeling. This collapsing can be performed using either the AggregateExpression or AverageExpression functions, as seen previously. The resulting matrix can then be fed into SingleR to produce the cluster-based annotations.

Idents(adp)="SCT_snn_res.0.1" #Assign clusters as the identities
avgExp = AverageExpression(adp, assays="SCT")$SCT #Run AverageExpression on the SCT assay and return only SCT
clustAnnot=SingleR(test=avgExp, ref=mouseRNASeq, labels=mouseRNASeq$label.main) #Run SingleR on the averaged expression matrix
clustAnnot

DataFrame with 8 rows and 4 columns
                           scores      labels delta.next pruned.labels
                         <matrix> <character>  <numeric>   <character>
g0 0.769608:0.406627:0.311650:... Fibroblasts  0.0845489   Fibroblasts
g1 0.814336:0.425592:0.304490:... Fibroblasts  0.1485983   Fibroblasts
g2 0.759363:0.389728:0.317296:... Fibroblasts  0.0755571   Fibroblasts
g3 0.821708:0.418202:0.304515:...  Adipocytes  0.0533554    Adipocytes
g4 0.834166:0.416754:0.310953:...  Adipocytes  0.1534154    Adipocytes
g5 0.728947:0.368267:0.359848:... Fibroblasts  0.0653320   Fibroblasts
g6 0.570283:0.210928:0.542144:... Macrophages  0.1005743   Macrophages
g7 0.828192:0.385496:0.314414:...  Adipocytes  0.0671988    Adipocytes

As before, most of the labels are assigned to either fibroblasts and adipocytes. The data takes a little massaging to assign to the scRNASeq metadata, but it is doable.

clustLabels = as.vector(clustAnnot$pruned.labels) #retrieve only the cluster-derived annotations
names(clustLabels) = c(0:7) #assign the cluster numbers as the annotations
clustLabels.vect = clustLabels[match(adp$SCT_snn_res.0.1, names(clustLabels))] #match the cluster identities per cell in the Seurat data to the cluster labels
names(clustLabels.vect) = colnames(adp) #ensure that the cluster identities are assigned the cell names
adp$mouseRNASeq.main.clust = clustLabels.vect #add the cluster annotations to the vector

clustAnnotFig1 = DimPlot(adp,group.by="SCT_snn_res.0.1",label=T)+NoLegend()
clustAnnotFig2 = DimPlot(adp,group.by="adipocyte",label=T)+NoLegend()
clustAnnotFig3 = DimPlot(adp,group.by="mouseRNASeq.main")
clustAnnotFig4 = DimPlot(adp,group.by="mouseRNASeq.main.clust")

(clustAnnotFig1|clustAnnotFig2)/(clustAnnotFig3|clustAnnotFig4)

Generally, the annotation between the per-cell annotation coincides with the per-cluster annotation. This image is also somewhat cleaner, as the cell-to-cell variation is superseded into the clusters. However, this also provides an example where this "collapsing" process can lead to smaller populations being overlooked: The cluster that was labeled as endothelial cells (at \~ [-10,-8]) was relabeled as fibroblasts, due to its initial inclusion as part of cluster 5.

5. Conclusions

Single cell RNASeq is a remarkably powerful tool for analyzing populations of cells that can be recovered from various experiments. Clustering and cell type annotation can be used to distinguish different populations with a level of specificity not otherwise available in bulk sequencing methods. Differential expression can then be applied to identify critical genes that define and classify the cell types. Inversely, differential expression can then be used to identify and classify the subpopulations, especially where cell type annotation falls short of the known biology. In either case, allow the biology to influence the bioinformatics, and use the bioinformatics to expand the biology.

Happy (gene) hunting!

Acknowledgements:

The following sources inspired this content:

This is only a small subset of tools available to single cell RNASeq. For further information and tools, please feel free to reach out to CCBR or BTEP.