Skip to content
PDF

Getting Started with Seurat: QC to Clustering

Learning Objectives

This tutorial was designed to demonstrate common secondary analysis steps in a scRNA-Seq workflow. We will start with a merged Seurat Object with multiple data layers representing multiple samples.

Throughout this tutorial we will

  1. Apply quality control parameters to retain only high quality cells
  2. Normalize and scale the data
  3. Apply dimensional reduction
  4. Perform Clustering

This tutorial largely follows the standard unsupervised clustering workflow by Seurat with slight deviations and a different data set.

Warning

This document in no way represents a complete analysis. Here, we are demonstrating many of the standard steps in Seurat. Results here would be considered preliminary, requiring greater care and consideration of biology. We are not reproducing published results.

The data

In this tutorial, we will continue to use data from Nanduri et al. 2022, Epigenetic regulation of white adipose tissue plasticity and energy metabolism by nucleosome binding HMGN proteins, published in Nature Communications.

As a reminder, this study examined "the role of HMGNs in white adipocyte browning by comparing wild-type (WT) mice and cells to genetically derived mice and cells lacking the two major members of the HMGN protein family, HMGN1 and HMGN2 (DKO mice).". The experimental design included WT vs DKO mice, 2 replicates each, at two time points, 0 and 6 days. At day 0, cells were in a preadipocyte state, while at day 6 they had differentiated into adipocytes. Each time point had two replicates of each condition (WT vs DKO), for a total of 8 samples.

An R project directory containing these data and associated files can be downloaded here.

Consider the biology in your analysis

The biology of your experiment will inform your analysis. While scRNA-Seq can be exploratory, you should have some general idea of the cell types that you expect. Knowing some characteristics of expected cells, for example, cell size or whether cells are complex or more energetically active is helpful for data processing.

For example, in this experiment, we expect cells transitioning from preadipocytes to adipocytes and then beige adipocytes.

Load the packages

library(tidyverse) # dplyr and ggplot2
library(Seurat) # Seurat toolkit
library(hdf5r) # for data import
library(patchwork) # for plotting
library(presto) # for differential expression
library(glmGamPoi) # for sctransform
Warning: package 'glmGamPoi' was built under R version 4.3.3

Load the Seurat Object

Here, we will start with the data stored in a Seurat object. For instructions on data import and creating the object, see an Introduction to scRNA-Seq with R (Seurat).

adp <- readRDS("../outputs/merged_Seurat_adp.rds")

Examine the object

glimpse(adp)
Formal class 'Seurat' [package "SeuratObject"] with 13 slots
  ..@ assays      :List of 1
  .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
  ..@ meta.data   :'data.frame':    50429 obs. of  6 variables:
  .. ..$ orig.ident  : chr [1:50429] "D10" "D10" "D10" "D10" ...
  .. ..$ nCount_RNA  : num [1:50429] 11188 32832 28515 18954 27181 ...
  .. ..$ nFeature_RNA: int [1:50429] 3383 5823 5002 4305 4586 1030 5015 3927 4006 5489 ...
  .. ..$ condition   : chr [1:50429] "DKO" "DKO" "DKO" "DKO" ...
  .. ..$ time_point  : chr [1:50429] "Day 0" "Day 0" "Day 0" "Day 0" ...
  .. ..$ cond_tp     : chr [1:50429] "DKO Day 0" "DKO Day 0" "DKO Day 0" "DKO Day 0" ...
  ..@ active.assay: chr "RNA"
  ..@ active.ident: Factor w/ 8 levels "D10","D16","D20",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:50429] "D10_AAACCCAAGATGCTTC-1" "D10_AAACCCAAGCTCTTCC-1" "D10_AAACCCAAGTCATCGT-1" "D10_AAACCCAGTAGCGTCC-1" ...
  ..@ graphs      : list()
  ..@ neighbors   : list()
  ..@ reductions  : list()
  ..@ images      : list()
  ..@ project.name: chr "Adipose"
  ..@ misc        : list()
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 5 0 1
  ..@ commands    : list()
  ..@ tools       : list()

Quality Control

The goal of quality control is to keep only high quality cells (i.e., remove low quality cells (dead or dying cells), cell-free RNA, or doublets). Low quality cells will impact downstream analyses.

Take care when filtering. Heavy filtering may result in the loss of rare cell populations, while minimal filtering, may impact our ability to annotate cell populations downstream. Best practices suggest being more permissive, filtering less cells.

Should quality control be assessed per sample?

Quality control should be applied on a sample per sample basis. Here, we will look at quality across all samples and then decide if quality thresholds need to be applied independently. If samples display the same distributions in quality metrics, it is likely okay to apply the same thresholds across all samples. If they do not follow the same distributions, you may lose valuable information by processing the samples together with the same thresholds.

QC metrics

There are several metrics that can be used to assess overall quality. The base workflow from Seurat suggests the following:

  • nCount_RNA - the absolute number of RNA molecules (UMIs) per cell (i.e., count depth). Each unique RNA molecule (non-PCR duplicates) will have its own Unique Molecular Identifier (UMI).

    • high total count - potential doublets or multiplets
    • low total count - potential ambient mRNA (not real cells)
    • Cell Ranger threshold set at 500 UMIs
  • nFeature_RNA - number of genes expressed (detected) per cell.

    • high number of detected genes - potential doublets or multiplets
    • low number of detected genes - potential ambient mRNA (not real cells)
  • Percent mitochondrial(percent.mt) - the fraction of reads from mitochondrial genes per cell

    • high mtDNA - cellular degradation

In general, low quality cells and empty droplets will have few genes (low n_feature_RNA, low nCount_RNA), whereas doublets and multiplets will exhibit high n_feature_RNA and high nCount_RNA. Low quality or dying cells will also have high percent_mt.

BUT

There are also biological explanations for some observed patterns.

Cells with a relatively high fraction of mitochondrial counts might for example be involved in respiratory processes and should not be filtered out. Whereas, cells with low or high counts might correspond to quiescent cell populations or cells larger in size. --- Single Cell Best Practices

Doublets and multiplets are also not as easy to predict. There are dedicated tools for this purpose (e.g., DoubletFinder, scDblFinder,scrublet (python package)), which you may want to integrate into your workflow. Eliminating ambient RNA is also not as straight forward as it may seem. Ambient RNA can be included in GEMs along with intact cells. There are tools that can be used to predict and remove ambient RNA (e.g., SoupX and DecontX).

Other QC Indicators

There are additional quality metrics that can be used for initial cell quality filtering (e.g., complexity /novelty, percent ribosomal).

QC metrics are stored as metadata

The quality metric information is stored in the metadata (adp@meta.data), so we really only need to work with the metadata file to get an idea of the thresholds we want to set for quality control. We will extract this information after we calculate the percentage of mitochondrial genes.

nCount_RNA and nFeature_RNA are already available to view, so we need to add the percentage of reads that mapped to the mitochondrial genome.

QC can be an iterative process

This entire workflow is exploratory in nature and can be highly iterative. You may need to adjust thresholds after performing downstream steps.

Add percent mitochondrial

To calculate percent.mt, we use PercentageFeatureSet(), which calculates the percentage of all the counts belonging to a subset of the possible features (e.g., mitochondrial genes, ribosomal genes) for each cell.

adp[["percent.mt"]] <- PercentageFeatureSet(adp, pattern = "^mt-")
Mitochondrial pattern changes based on species

The pattern used here is species dependent. pattern = "^MT-", works for human genomes, whereas pattern = "^mt-" works for mouse.

mt_genes<-grep("^mt-", rownames(adp[["RNA"]]$counts.W10),value = T)
mt_genes
[1] "mt-Nd1"  "mt-Nd2"  "mt-Co1"  "mt-Co2"  "mt-Atp8" "mt-Atp6" "mt-Co3" 
[8] "mt-Nd3"  "mt-Nd4l" "mt-Nd4"  "mt-Nd5"  "mt-Nd6"  "mt-Cytb"

Extract the metadata

metadata <- adp@meta.data

QC assessment

The quality of the cells should be assessed considering the above metrics jointly and not simply in isolation.

QC can be applied manually by subjectively choosing and applying thresholds, which can be quite arbitrary. Alternatively, QC can be applied automatically using adaptive thresholds like the mean absolute deviation (MAD) (outliers generally 3 or 5 MAD from the median). You can find more information on adaptive thresholds here.

Here, we are going to start with applying thresholds based on a subjective visual assessment of quality metrics.

Step 1: Visualize the data

There are several built in functions for visualizing data with Seurat. We can use violin plots and scatter plots to check out the individual distributions and correlations between metrics.

nCount_RNA

Look at the distribution of nCount_RNA with a Violin plot:

# set colors
cnames<-setNames(rep(c("cyan3","darkgoldenrod1"),each=4),levels(factor(metadata$orig.ident))) 

# plot
VlnPlot(adp, features = "nCount_RNA", layer="counts", group.by="orig.ident",raster=FALSE,alpha=0.2) +
  scale_fill_manual(values=cnames) 

A violin plot shows the distribution of a numeric variable across categorical variables of interest; it is a hybrid between a box plot and a density plot. The wider sections on the plot are associated with increased probabilities of data points being found at that location in the distribution. VlnPlot, a Seurat function that uses the Seurat object directly, overlays the individual data points to ease the interpretation of the figure.

Here, we can see that we have two bulges of increased density within the distribution, one closer to zero and another from 10 k to 30 k.

Notice that we can modify Seurat visualization using ggplot2 layers, or we can explore the data with greater flexibility by using ggplot2directly.

Use ggplot2 directly and check the distributions:

metadata %>% 
    ggplot(aes(color=orig.ident, x=nCount_RNA, fill= orig.ident)) + 
    geom_density(alpha = 0.2) + 
    theme_classic() +
    scale_x_log10() + 
    geom_vline(xintercept = 650,color="red",linetype="dotted")

Using a density plot directly, we can see that the samples at time point 0 are different from those at time point 6. We should always be thinking about whether these differences are biological or technical in nature.

In this case, it is a bit of both. Samples from time point 6 exhibited diminished quality but also represent different biology.

Number of Genes

Now, let's take a look at the number of detected features.

VlnPlot(adp, features = "nFeature_RNA", group.by="orig.ident") +
  scale_fill_manual(values=cnames) 
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

ggplot(metadata, aes(x=nFeature_RNA,fill=orig.ident)) +
   geom_density(alpha = 0.2) + 
    theme_classic() +
    scale_x_log10() + 
    geom_vline(xintercept = 350,color="red",linetype="dotted")

Percent mitochondrial
VlnPlot(adp, features = "percent.mt", group.by="orig.ident") +
  scale_fill_manual(values=cnames) +
  geom_hline(yintercept=10,color="red")
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

ggplot(metadata, aes(x=percent.mt,fill=orig.ident)) +
   geom_density(alpha = 0.2) + 
  scale_x_log10()+
    theme_classic()  
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning: Removed 13 rows containing non-finite outside the scale range
(`stat_density()`).

We can see that the mitochondrial content is higher for the second time point. This could be due to technical challenges and cellular degradation or biology, for example, in this case, white adipocyte browning. Beige adipocytes have a higher potential for mitochondrial respiration, so this is not unexpected.

Examine these metrics together

To look at how these metrics correlate, we can use FeatureScatter(), which can be used to visualize feature-feature relationships and also be applied to other data stored in our Seurat object (e.g., metadata columns, PC scores).

FeatureScatter(adp, feature1 = "percent.mt", feature2 ="nFeature_RNA" , group.by="orig.ident",split.by="time_point") 

FeatureScatter(adp, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",group.by="orig.ident",split.by="time_point",log=TRUE)

Again, we can also use ggplot2 directly.

ggplot(metadata) +
  geom_point(aes(x=nCount_RNA,y=nFeature_RNA,fill=percent.mt > 10),shape=21,alpha=0.4) + 
  theme_classic() +
  scale_x_log10()+
  scale_y_log10()+
  facet_grid(.~cond_tp) +
  geom_vline(xintercept = 650,color="red",linetype="dotted")+
  geom_hline(yintercept=350,color="red", linetype="dotted")

ggplot(metadata) +
  geom_point(aes(x=nCount_RNA,y=nFeature_RNA,fill=percent.mt > 25),shape=21,alpha=0.4) + 
  theme_classic() +
  scale_x_log10()+
  scale_y_log10()+
  facet_grid(.~cond_tp) +
  geom_vline(xintercept = 650,color="red",linetype="dotted")+
  geom_hline(yintercept=350,color="red", linetype="dotted")

Warning

Again, quality filtering should be done at a per sample level. Here, there are different distributions by time point, which we may want to consider when filtering.

Is it possible to introduce bias when applying quality filtering on a per sample basis?

Yes, it's possible. There can be variation between the quality of different samples, as seen here. By not adjusting to the specific sample, you can just as easily introduce bias by keeping or removing different populations of cells for each of the samples. Best case scenario is all samples look similar and you can employ a standard filtering threshold across all samples.

Step 2: Decide on thresholds and apply filtering

Because of the different layers, you will need to break down the object to apply different thresholds by group or sample. The easiest way to do this is to work with the metadata and use the cell barcodes for filtering with Seurat::subset().

# Set one set of parameters for Day 0 samples; 
# keep the rownames (Cell barcodes)
t0<- metadata |> filter(time_point=="Day 0", nFeature_RNA > 350, 
                        nCount_RNA >650, percent.mt <10 ) |> 
  rownames_to_column("Cell") |> pull(Cell) 

# Set an alternative set of thresholds for Day 6 samples; 
# keep the rownames (Cell barcodes)
t6<- metadata |> filter(time_point=="Day 6", nFeature_RNA > 350, 
                        nCount_RNA >650, percent.mt <25 ) |> 
  rownames_to_column("Cell") |> pull(Cell)

keep<-c(t0,t6)
What is |>?

The |> is the native R pipe. It allows you to pipe input from the left to the first argument on the right. See here.

Step 3: Filter

Now, we can either use our filtering parameters directly with subset() or provide a cells argument.

# use different parameters; established above
adp_filt<-subset(adp, cells=keep)

You do not need to run this:

# OR use the same parameters across all samples
adp_filt<-subset(adp, nFeature_RNA > 350 & 
                        nCount_RNA >650 & percent.mt <10 )

Note

This is not the only way to filter the Seurat object. There is a lot of redundancy built into R. Feel free to explore your options.

Once you have subset your object, it is recommended to run through these plots again. We are skipping this here.

Want more information about quality filtering?

Looking for more information on quality control filters? Here is a useful guide from 10X genomics.

Save the quality filtered object

I recommend saving the object post filtering to work with downstream. This way you can return to this point should you need to.

saveRDS(adp_filt,"../outputs/adp_merge_filt.rds")

Normalization, find variable features, and scale

Standard Workflow

Normalize

The total number of detected transcripts expressed in a cell is dependent on the amount of mRNA in a cell. Cells naturally vary in the total amount of mRNA expressed. However, the chemistry of the assay inserts technical noise that results in variation in the counts of mRNA even between identical cells. Sources of technical variation include differences in cell lysis, reverse transcription efficiency, and stochastic molecular sampling during sequencing, among others. Because differences in observed expression result from a combination of biological and technical variation, we need to minimize or remove the impact of technical effects in order to compare expression levels across cells.

The standard normalization method in Seurat is a global-scaling normalization method (?NormalizeData(), default LogNormalize), which transforms the feature counts for each cell by dividing by the total counts of the cell and multiplying by a scale factor (default = 10000) and then applying a natural log with a pseudocount of 1. This should help reduce variation from sequencing depth and somewhat prevent high abundance genes from dominating downstream analyses.

Find Variable features (FindVariableFeatures)

Following normalization, the next step is to find variable features. In most scRNA-seq experiments only a small proportion of the genes will be informative and biologically variable. A subset of cells with high cell to cell variation can be selected using the variance stabilization transformation method of FindVariableFeatures. This uses a mean-variance relationship to return 2,000 variable features to prioritize for downstream analyses. You can read more about the method here.

ScaleData

It is next standard to scale and center the features in the data set prior to dimension reduction or visualization via heatmap. Scaling the data will keep highly expressed genes from dominating our analysis. This step is done using ScaleData(), and by default, this is only applied to the variable features found above.

Warning

Scaling only the top variable features could result in warnings later when attempting to visualize differentially expressed genes; this is also true of SCTransform. Note: it is also possible to apply ScaleData to the entire data set.

Scaling the data

Shifts the expression of each gene, so that the mean expression across cells is 0.
Scales the expression of each gene, so that the variance across cells is 1.--- Seurat Vignette.

ScaleData() can also be used to remove unwanted sources of variation (e.g., cell cycle or percent mitochondrial) using the argument vars.to.regress. There is a great tutorial from the Harvard Chan Bioinformatics Core demonstrating how you can explore these unwanted sources prior to deciding whether they should be mitigated.

Standard protocol: NormalizeData, FindVariableFeatures, and ScaleData
#normalize
adp_filt <- NormalizeData(adp_filt, 
                          normalization.method = "LogNormalize", 
                          scale.factor = 10000)

# find variable features
adp_filt <- FindVariableFeatures(adp_filt, selection.method = "vst",
                                nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(adp_filt), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(adp_filt)
LabelPoints(plot = plot1, points = top10, repel = TRUE)


adp_filt <- ScaleData(adp_filt, vars.to.regress = "percent.mt")

Normalized counts are stored in data layers; for example, we can access the log normalized counts of W10 using adp_filt@assays$RNA$counts.W10. ScaleData() are stored at adp_filt[["RNA"]]$scale.data.

SCTransform

Rather than relying on the above steps (NormalizeData(), FindVariableFeatures(), and ScaleData()), we are going to proceed with a newer method (SCtransform) instead. This method uses Pearson residuals for transformation, which better accounts for the overall distribution of expression. This method assigns greater weight to lowly expressed genes that may exhibit cell type specific expression rather than highly expressed genes with broad expression. Read more about the method here.

Things to know

  • Transformed data will be available in the SCT assay, which is set as the default after running sctransform.

  • During normalization, we can also remove confounding sources of variation, for example, mitochondrial mapping percentage

  • The glmGamPoi package substantially improves speed and is used by default if installed, with instructions here --- SCTransform vignette

Also, by default, the scaled data layer (adp_filt[["SCT"]]$scale.data) only contains variable features (3,000). This saves substantial memory.

Let's run the function. Notice that we are regressing out the effect of percent mitochondria. Of note, if we think differences in mitochondrial gene expression are related to biology, we may not want to do this, as it could help with clustering. We can always repeat steps as needed. This analysis can be highly interactive.

adp_filt <- SCTransform(adp_filt, vars.to.regress = "percent.mt", verbose = FALSE)

At this point, the active.assay for adp_filt is now "SCT".

Default Assays

It is important to know the arguments for each function used. You can switch between assays, for example, setting the default assay to "RNA", using:

#Check default assay
DefaultAssay(object = adp_filt) 

#Set default assay
DefaultAssay(object = adp_filt) <- "RNA"
Confused about normalization?

I recommend this excellent high level overview available by Florian Wagner on YouTube.

There is also a normalization guide by 10X genomics.

Linear dimension reduction

The role of cell clustering is to identify cells with similar transcriptomic profiles by computing Euclidean distances across genes. However, scRNA-Seq experiments include highly dimensional data, with each cell associated with the expression of thousands of genes, which are impacted by biological and technical factors.

Principal component analysis (PCA) is a linear dimension reduction method applied to highly dimensional data. The goal of PCA is to reduce the dimensionality of the data by transforming the data in a way that maximizes the variance explained. The first PC explains the most variance, followed by the second PC, and so on so on.

To overcome the extensive technical noise in any single feature (gene) for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset. --- Seurat vignette.

Linear dimension reduction via PCA thus eliminates noise and speeds up clustering. Learn more about PCA here and here.

We run PCA using RunPCA() on our SCtransformed data.

adp_filt <- RunPCA(adp_filt, verbose = FALSE,assay="SCT")

Important

Use scaled data here.

After we run PCA, we need to decide which PCs to include in downstream analyses. We want to include enough PCs to retain the biological signal, but few enough PCs to avoid interference by noise in the data.

Exploring the PCA results

There are a number of ways to explore the PCA results. Two of the more useful visualizations include the DimHeatmap() and ElbowPlot().

DimHeatmap() allows us to visualize the top genes contributing to each PC. Both cells and genes are sorted by their principal component scores. By default it includes the top 30 genes with the highest and lowest PC loadings.

Here you can see how well a PC separates populations of cells.

# dimensions 1 to 9
DimHeatmap(adp_filt, dims = 1:9, cells = 500, balanced = TRUE,ncol=3)

#dimensions 20 to 30
DimHeatmap(adp_filt, dims = 20:30, cells = 500, balanced = TRUE,ncol=3)

Complementing the above heatmap, we can also create an elbow plot, which produces a ranking of principle components based on the percentage of variance explained by each one.

ElbowPlot(adp_filt, ndims = 40)

We can see the amount of variance explained tapers off around 30. Perhaps this is a good starting point?

Of note, the Seurat developers indicate that it is safer to use a greater number of PCs when using SCTransform, which is more efficient at eliminating technical effects.

Check out the Seurat - Guided Clustering Tutorial for more examples of visualizing PCA results.

Clustering

Clustering is used to group cells by similar transcriptomic profiles. Seurat uses a graph based clustering method. You can read more about it here.

The first step is to compute the nearest neighbors of each cell based on similarity. This is done using FindNeighbors(), which will construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). Therefore, an important argument is the dimensions of our PCA that we would like to use.

This then informs FindClusters(), which uses an unsupervised learning technique (modularity optimization with the Louvain algorithm) to group multiple data points into clusters based on their similarities. Importantly, this function includes a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters.

Choosing the "right" resolution

It is difficult to know whether the clustering is "correct". This will largely depend on your goals and your experience. You may want more or less resolution based on your scientific question. Are you interested in major types of cells, sub-types, cell states? Novel sub-types and cell states will require greater resolution.

Here, we keep resolution fairly low to get a preliminary idea of the broad cell types. Feel free to play with the resolution to see how this impacts clustering. The Seurat developers suggest a resolution of 0.4-1.2 for data sets of ~3,000 cells.

adp_filt <- FindNeighbors(adp_filt, dims = 1:30)
Computing nearest neighbor graph
Computing SNN
adp_filt <- FindClusters(adp_filt, resolution = 0.1)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 42114
Number of edges: 1480064

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9546
Number of communities: 8
Elapsed time: 9 seconds

At this point, the active.ident is now seurat.clusters.

Now we can visualize our clusters using a non-linear dimension reduction method. UMAPs (uniform manifold approximation and projection) and t-SNE (t -stochastic neighbor embedding) are common reduction / visualization techniques for single cell data sets. UMAP has recently become the gold standard for this type of analysis due to its computational efficiency and ability to better maintain global structure; though, like t-SNE it is likely much more accurate at local distances. Read more about non-linear methods for visualization here.

Here, we run UMAP to embed the complexity of the data within a two-dimensional plot.

adp_filt <- RunUMAP(adp_filt,dims = 1:30)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
21:10:28 UMAP embedding parameters a = 0.9922 b = 1.112
21:10:28 Read 42114 rows and found 30 numeric columns
21:10:28 Using Annoy for neighbor search, n_neighbors = 30
21:10:28 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
21:10:33 Writing NN index file to temp file /var/folders/hv/m8yfpqmn4sgfw9f0j7xfzjhc3fj3st/T//RtmpyM7lsj/file376eba89b0
21:10:33 Searching Annoy index using 1 thread, search_k = 3000
21:10:47 Annoy recall = 100%
21:10:49 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
21:10:51 Initializing from normalized Laplacian + noise (using RSpectra)
21:10:52 Commencing optimization for 200 epochs, with 1854092 positive edges
21:11:15 Optimization finished
DimPlot(adp_filt, reduction = "umap", group.by = c("orig.ident", "seurat_clusters","time_point","condition","cond_tp"),
        alpha=0.4, ncol=2)

These data do not exhibit batch effects by sample or condition, but show major differences by time point. This is expected given the biology of the samples.

Take a moment to save your object at this point.

saveRDS(adp_filt, "../outputs/adp_merge_filt_sctran_clust0.1.rds")

Note

This saved R object (adp_merge_filt_sctran_clust0.1.rds) will be used as the starting point in the next seminar.

A word about integration.

Integration is the process of aligning the same cell types across samples, treatments, data sets, batches, etc. Clustering should represent biological differences and not technical artifacts. Integration is not always necessary. You should run through the standard workflow and decide whether integration is required post-clustering. If cells are clustering by cell type rather than other factors (e.g., sample), you can likely skip integration. See the integration vignette.

Here, we aren't integrating.

Finding Cluster Biomarkers

Now, that we have clusters, we can use differential expression analysis to uncover markers that define our clusters. These markers can be used to assign cell types to our clusters.

First, because we are working with a merged data set (with multiple SCT models), we need to run PrepSCTFindMarkers() prior to differential expression testing.

This allows us to use the ‘corrected counts’ that are stored in the data slot of the the SCT assay. Corrected counts are obtained by setting the sequencing depth for all the cells to a fixed value and reversing the learned regularized negative-binomial regression model. PrepSCTFindMarkers() ensures that the fixed value is set properly. --- Archived Seurat Vignette

adp_filt<-PrepSCTFindMarkers(adp_filt,verbose=T)
Found 8 SCT models. Recorrecting SCT counts using minimum median counts: 8146

Once this is done, we can perform differential expression testing for each cluster compared to all other clusters using FindAllMarkers(). only.pos = TRUE will only return marker genes with an avg_log2FC.

Note

There are many available options for the method used for differential expression testing. See the test.use parameter. By default, a Wilcoxon Rank Sum test is applied.

#requires presto installation to speed up the next step
# devtools::install_github('immunogenomics/presto')
library(presto)
# find all markers
adp_filt_markers <- FindAllMarkers(adp_filt, only.pos = TRUE)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
#ordering the results
adp_filt_markers <- adp_filt_markers %>% 
  arrange(cluster,desc(avg_log2FC), desc(p_val_adj))

#examine a small subset
adp_filt_markers %>%
    group_by(cluster) %>%
    slice_max(n = 5, order_by = avg_log2FC)
# A tibble: 41 × 7
# Groups:   cluster [8]
       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
 1 3.96e- 46       4.02 0.01  0.001 7.57e- 42 0       Erv3     
 2 2.29e-144       3.96 0.032 0.002 4.37e-140 0       Pdlim3   
 3 0               3.70 0.159 0.014 0         0       Mlana    
 4 1.47e-140       3.55 0.033 0.003 2.80e-136 0       Nat8f3   
 5 0               3.47 0.158 0.02  0         0       Scg2     
 6 0               3.50 0.257 0.044 0         1       Cxcl13   
 7 1.28e- 66       3.15 0.016 0.002 2.45e- 62 1       Dlx3     
 8 0               3.15 0.38  0.054 0         1       Serpina3m
 9 1.64e- 32       3.13 0.013 0.003 3.13e- 28 1       Pyy      
10 0               3.05 0.427 0.056 0         1       Scara5   
# ℹ 31 more rows

In the output of adp_filt_markers,

pct.1 is the percentage of cells in the cluster where the gene is detected, while pct.2 is the percentage of cells on average in all the other clusters where the gene is detected. A gene to be considered as an IDEAL cluster marker is expected to be expressed exclusively in that cluster and silenced in all others and thus pct.1 will be more towards 1 and pct.2 towards 0.---Biostars Post

We can use a heatmap to visualize potential marker genes.

top20<-adp_filt_markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 20) %>%
    ungroup() 
DoHeatmap(adp_filt, features = top20$gene) + NoLegend()
Warning in DoHeatmap(adp_filt, features = top20$gene): The following features
were omitted as they were not found in the scale.data slot for the SCT assay:
Gm13709, Vnn1, Igfals, Pla2g5, Sctr, C6, Spink13, Igsf21, Myl1, Atp1a4, Ptgds,
Gm37800, Mup3, 2010106E10Rik, Vnn3, Saa4, Rnase2b, Smim5, Spi1, C1qc, C1qa,
Mgl2, C1qb, Dock2, Ly9, Slamf7, Clec4e, Bcl2a1a, Bcl2a1d, Ly86, Fcrls, Bcl2a1b,
Tifab, Slc28a2, Cst7, Gm5150, Myo1g, Gm26740, Tie1, Icam2, Gimap8, Dapk2,
Zfp366, Mall, Gimap1, Kcne3, Gm14207, Gimap4, Gpihbp1, Gimap5, Emcn, Cldn5,
C1qtnf9, Nts, Gm32688, Myct1, Sp5, Sox17, Sox6os, Apoc3, Paqr9, Ucp1, Klb,
Thrsp, Gm45607, Acvr1c, Adora1, Acot5, Ffar4, Acot3, A530016L24Rik, Coro6,
A830018L16Rik, Olfr810, Amt, Glt1d1, Masp2, Gm17638, Gm43585, Gm43150,
4930565N06Rik, Gm17300, Cbln2, Snap25, Galnt15, Agtr2, Sptssb, Gm50114, Aox4,
Hcn4, Gm45924, Hmgcs2, Pyy, Dlx3, Slc22a18, Nat8f5, Cyb5r2, Cfi, Ppef1, Pdzd7,
Gm15866, Tnfsf8, Kcnma1, Sync, 2810433D01Rik, Mamdc2, Apba2, 8030423F21Rik,
Scg2, Nat8f3, Mlana, Pdlim3, Erv3

We can also use VlnPlot() and FeaturePlot() to compare marker expression. VlnPlot() shows expression probability distributions across clusters, and FeaturePlot() visualizes feature expression on a tSNE or PCA plot.

In the case of this example data set, we can identify our clusters using published markers associated with the data (from the original publication).

#contaminants
contam=c("Ptprc", "Plp1", "Pecam1")
#beige adipocytes
beige=c("Ucp1","Ppargc1a","Elovl3","Cidea")
#preadipocytes
preadip<-c('Mmp3', 'Cd142', 'Itgb1')
#proliferating
prolif<-'Mki67'
#differentiating adipocytes
diffadip<-c('Col5a3', 'Serpina3n')

VlnPlot(adp_filt, features=contam)

FeaturePlot(adp_filt,features=contam)

For example, clusters 5 and 6 are likely contaminant cell populations, which could be removed.

  • CD45+ (gene = Ptprc)
  • Pecam+ (gene = Pecam1) = endothelial cells
  • Plp1+ = neuronal / glial

The next seminar will discuss differential expression analysis in greater detail.

Cell Annotation

There are many tools and strategies available for cell annotation. I've included a non-comprehensive list for your convenience:

Here is a list of resources that also may be helpful from 10X genomics:

As well as this chapter from OSCA.

Acknowledgements:

The following sources inspired this content: