Core Steps in scRNA-seq Analysis: QC to Clustering

Modified

May 19, 2026

Learning Objectives

This tutorial continues from “scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat”, and uses the same lung scRNA-seq dataset and Seurat object structure.

Throughout this tutorial we will:

  1. Apply quality control filters to retain high-quality cells
  2. Normalize the data with SCTransform()
  3. Perform dimensionality reduction
  4. Cluster

This lesson follows the standard Seurat workflow for Seurat v5 layered assays but with a different dataset.

Warning

This tutorial demonstrates common pre-processing steps, not a complete analysis. All thresholds and clustering decisions shown here should be treated as starting points that need to be revisited in the context of the biology.

The Data

As in Part 1, we are working with a subset of the lung regeneration dataset from Niethamer et al. 2025, Longitudinal single-cell profiles of lung regeneration after viral infection reveal persistent injury-associated cell states.

The four samples used here are:

Sample Species Genotype Treatment Time Point Status
GSM8181589 (S89Inf42) Mouse Ki67-CreERT2; ROSA26-LSL-tdTomato Tamoxifen + Influenza Day 42 post-infection Lung (post-infection)
GSM8181591 (S91Hom42) Mouse Ki67-CreERT2; ROSA26-LSL-tdTomato Tamoxifen Homeostasis Day 42 Lung (homeostasis)
GSM8181593 (S93Hom3) Mouse Ki67-CreERT2; ROSA26-LSL-tdTomato Tamoxifen Homeostasis Day 3 Lung (homeostasis)
GSM8181599 (S99Inf42) Mouse Ki67-CreERT2; ROSA26-LSL-tdTomato Tamoxifen + Influenza Day 42 post-infection Lung (post-infection)

This subset includes lung tissue collected at two homeostasis time points (day 3 and day 42) and two samples from the post-infection condition at day 42. The two post-infection samples (S89Inf42 and S99Inf42) are biological replicates of the infected condition at day 42, enabling comparison of injury-associated lung cell states against homeostatic controls.

We should expect both:

  • technical differences between samples
  • genuine biological differences across treatment

This is why QC should be assessed per sample even though we are working with a merged object.

An R project directory containing these data and associated files can be downloaded here.

Load R Packages.

library(Seurat) # Seurat toolkit
library(tidyverse) # Data wrangling and plotting
library(hdf5r) # Data import
library(patchwork) # Plotting
library(scuttle) #Adding to our single cell toolkit with Bioconductor
library(glmGamPoi) # For SCTransform 
library(presto) # For FindAllMarkers  

Load the Seurat Object

In “scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat” we created and saved a merged Seurat object to ./outputs/merged_Seurat_lung.rds. Here, we will load and continue to work from this object.

lung <- readRDS("./outputs/merged_Seurat_lung.rds")

Examine the object:

lung
An object of class Seurat 
20329 features across 28494 samples within 1 assay 
Active assay: RNA (20329 features, 0 variable features)
 4 layers present: counts.S89Inf42, counts.S91Hom42, counts.S93Hom3, counts.S99Inf42
glimpse(lung)
Formal class 'Seurat' [package "SeuratObject"] with 13 slots
  ..@ assays      :List of 1
  .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
  ..@ meta.data   :'data.frame':    28494 obs. of  6 variables:
  .. ..$ orig.ident  : chr [1:28494] "S89Inf42" "S89Inf42" "S89Inf42" "S89Inf42" ...
  .. ..$ nCount_RNA  : num [1:28494] 5341 3896 9012 6752 1339 ...
  .. ..$ nFeature_RNA: int [1:28494] 2214 1513 2617 2595 201 3162 2097 1152 3149 3705 ...
  .. ..$ treatment   : chr [1:28494] "Tamoxifen_Influenza" "Tamoxifen_Influenza" "Tamoxifen_Influenza" "Tamoxifen_Influenza" ...
  .. ..$ time_point  : chr [1:28494] "Infection_Day42" "Infection_Day42" "Infection_Day42" "Infection_Day42" ...
  .. ..$ treatment_tp: chr [1:28494] "Tamoxifen_Influenza_Infection_Day42" "Tamoxifen_Influenza_Infection_Day42" "Tamoxifen_Influenza_Infection_Day42" "Tamoxifen_Influenza_Infection_Day42" ...
  ..@ active.assay: chr "RNA"
  ..@ active.ident: Factor w/ 4 levels "S89Inf42","S91Hom42",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:28494] "S89Inf42_AAACCCAAGCAACAGC" "S89Inf42_AAACCCACAACTAGAA" "S89Inf42_AAACCCACATCGCTGG" "S89Inf42_AAACCCACATGGAAGC" ...
  ..@ graphs      : list()
  ..@ neighbors   : list()
  ..@ reductions  : list()
  ..@ images      : list()
  ..@ project.name: chr "Lung"
  ..@ misc        : list()
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 5 4 0
  ..@ commands    : list()
  ..@ tools       : list()
Layers(lung)
[1] "counts.S89Inf42" "counts.S91Hom42" "counts.S93Hom3"  "counts.S99Inf42"
Note

In Seurat v5, merged RNA assays retain per-sample matrices as separate layers such as counts.S89Inf42 and counts.S91Hom42. That is why the previous seminar emphasized layers so heavily, and it is why this merged-object workflow still supports per-sample QC.

Quality Control

The goal of quality control is to keep only high quality cells (i.e., remove low quality cells (dead or dying cells), cell-free RNA, or doublets (droplets containing two cells)). Low quality cells will impact downstream analyses. The most common starting metrics are:

  • nCount_RNA - total UMI count per cell, representing the number of captured/countable transcript molecules detected in that cell.

    • high total count - potential doublets or multiplets
    • low total count - potential ambient mRNA (not real cells)
    • Cell Ranger threshold set at 500 UMIs
    NoteWhat is a UMI?

    Unique Molecular Identifiers (UMIs) are short barcode sequences added to each captured RNA molecule before amplification. They are used to identify PCR duplicates so sequencing reads from the same original molecule can be collapsed into one count.

  • nFeature_RNA - number of genes expressed (detected) per cell.

    • high number of detected genes - potential doublets or multiplets
    • low number of detected genes - potential ambient mRNA (not real cells)
  • Percent mitochondrial(percent.mt) - the fraction of reads from mitochondrial genes per cell

    • high mtDNA - cellular degradation

Broadly:

Low quality cells and empty droplets will have few genes (low n_feature_RNA, low nCount_RNA), whereas doublets and multiplets will exhibit high n_feature_RNA and high nCount_RNA. Low quality or dying cells will also have high percent_mt.

BUT

There are also biological explanations for some observed patterns.

Cells with a relatively high fraction of mitochondrial counts might for example be involved in respiratory processes and should not be filtered out. Whereas, cells with low or high counts might correspond to quiescent cell populations or cells larger in size. — Single Cell Best Practices

Therefore, it is important to know your biology, and have some idea of expected cell types.

TipOther QC Tools

Doublets and multiplets are not easy to predict. There are dedicated tools for this purpose (e.g., DoubletFinder, scDblFinder,scrublet (python package)), which you may want to integrate into your workflow. Eliminating ambient RNA is also not as straight forward as it may seem. Ambient RNA can be included in GEMs along with intact cells. There are tools that can be used to predict and remove ambient RNA (e.g., SoupX and DecontX).

Even in a merged object, QC should be evaluated by sample. Different samples can have different count depths or mitochondrial distributions, and some of those differences may reflect biology rather than technical failure.

Yes, it is possible to introduce bias when working with samples independently. There can be variation between the quality of different samples. By not adjusting to the specific sample, you can just as easily introduce bias by keeping or removing different populations of cells for each of the samples. Best case scenario is all samples look similar and you can employ a standard filtering threshold across all samples.

  • fraction of ribosomal protein genes - higher due to degradation or contamination
  • Complexity (Novelty Score) - ratio of genes detected to UMI - greater gene to UMI count equals higher complexity.

QC metrics are stored as metadata

The quality metric information is stored in the metadata (lung@meta.data or lung[[]]), so we only need to work with the metadata file to get an idea of the thresholds we want to set for quality control. We will extract this information after we calculate the percentage of mitochondrial genes.

nCount_RNA and nFeature_RNA are already available to view, so we need to add the percentage of reads that mapped to the mitochondrial genome.

WarningQC can be an iterative process.

Take care when filtering cells during QC. Heavy filtering may result in the loss of rare cell populations, while minimal filtering, may impact our ability to annotate cell populations downstream. It’s best to start with conservative QC filtering, which can then be reassessed following downstream steps.

Add percent mitochondrial

To calculate percent.mt, we use PercentageFeatureSet(), which calculates the percentage of all the counts belonging to a subset of the possible features (e.g., mitochondrial genes, ribosomal genes) for each cell.

For this mouse dataset, mitochondrial genes begin with mt-.

lung[["percent.mt"]] <- PercentageFeatureSet(lung, pattern = "^mt-")

For this data set we will not find this pattern present.

mt_genes<-grep("^MT-", rownames(lung[["RNA"]]$counts.S89Inf42),value = T)
mt_genes
character(0)

Extract metadata

Using lung[[]] is the most convenient v5-friendly way to work with cell metadata.

metadata <- lung[[]]
head(metadata)
                          orig.ident nCount_RNA nFeature_RNA
S89Inf42_AAACCCAAGCAACAGC   S89Inf42       5341         2214
S89Inf42_AAACCCACAACTAGAA   S89Inf42       3896         1513
S89Inf42_AAACCCACATCGCTGG   S89Inf42       9012         2617
S89Inf42_AAACCCACATGGAAGC   S89Inf42       6752         2595
S89Inf42_AAACCCAGTATGGAGC   S89Inf42       1339          201
S89Inf42_AAACCCAGTCATCACA   S89Inf42      14257         3162
                                    treatment      time_point
S89Inf42_AAACCCAAGCAACAGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCACAACTAGAA Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCACATCGCTGG Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCACATGGAAGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCAGTATGGAGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCAGTCATCACA Tamoxifen_Influenza Infection_Day42
                                                 treatment_tp percent.mt
S89Inf42_AAACCCAAGCAACAGC Tamoxifen_Influenza_Infection_Day42   4.755664
S89Inf42_AAACCCACAACTAGAA Tamoxifen_Influenza_Infection_Day42   1.899384
S89Inf42_AAACCCACATCGCTGG Tamoxifen_Influenza_Infection_Day42   4.627164
S89Inf42_AAACCCACATGGAAGC Tamoxifen_Influenza_Infection_Day42   5.687204
S89Inf42_AAACCCAGTATGGAGC Tamoxifen_Influenza_Infection_Day42  65.123226
S89Inf42_AAACCCAGTCATCACA Tamoxifen_Influenza_Infection_Day42   5.337729

QC assessment

The quality of the cells should be assessed considering the above metrics jointly and not simply in isolation.

QC can be applied manually by subjectively choosing and applying thresholds, which can be arbitrary. Alternatively, QC can be applied automatically using adaptive thresholds like the mean absolute deviation (MAD) (outliers generally 3 or 5 MAD from the median) (See scuttle::isOutlier()). You can find more information on adaptive thresholds here.

Here, we are going to start with applying thresholds based on a subjective visual assessment of quality metrics.

Step 1: Visualize the data

There are several built in functions for visualizing data with Seurat. We can use violin plots and scatter plots to check out the individual distributions and correlations between metrics. We can also use other plot types beyond Seurat (for example, density plots).

Check out scCustomize for more options for visualizing single-cell data.

Set sample colors once so our plots stay consistent:

sample_colors <- c(
  S89Inf42 = "#0072B2",  # blue
  S91Hom42 = "#E69F00",  # orange
  S93Hom3 = "#009E73",  # bluish green
  S99Inf42 = "#CC79A7"   # reddish purple
)

Create factors and set colors by condition:

metadata <- metadata |> mutate(across(where(is.character),as.factor))
glimpse(metadata)
Rows: 28,494
Columns: 7
$ orig.ident   <fct> S89Inf42, S89Inf42, S89Inf42, S89Inf42, S89Inf42, S89Inf4…
$ nCount_RNA   <dbl> 5341, 3896, 9012, 6752, 1339, 14257, 8647, 3112, 12509, 1…
$ nFeature_RNA <int> 2214, 1513, 2617, 2595, 201, 3162, 2097, 1152, 3149, 3705…
$ treatment    <fct> Tamoxifen_Influenza, Tamoxifen_Influenza, Tamoxifen_Influ…
$ time_point   <fct> Infection_Day42, Infection_Day42, Infection_Day42, Infect…
$ treatment_tp <fct> Tamoxifen_Influenza_Infection_Day42, Tamoxifen_Influenza_…
$ percent.mt   <dbl> 4.755664, 1.899384, 4.627164, 5.687204, 65.123226, 5.3377…
#Set colors(# set colors
condition_colors <- setNames(ifelse(levels(metadata$orig.ident) %in% c("S89Inf42","S99Inf42"), "cyan3","darkgoldenrod1"), levels(metadata$orig.ident))

nCount_RNA

Look at the distribution of nCount_RNA with a Violin plot:

# plot
VlnPlot(lung, features = "nCount_RNA", layer="counts", 
        group.by="orig.ident", alpha=0.05) +
  scale_fill_manual(values=condition_colors) +
  scale_y_continuous(breaks=seq(0,150000,by=10000))
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

A violin plot shows the distribution of a numeric variable across categorical variables of interest; it is a hybrid between a box plot and a density plot. The wider sections on the plot are associated with increased probabilities of data points being found at that location in the distribution. VlnPlot, a Seurat function that uses the Seurat object directly, overlays the individual data points to ease the interpretation of the figure.

Here we can see that the greatest density of cells have between 0 and 5000 UMIs, with most of the cells having less than 30,000 total UMIs.

We also see a handful of cells with extremely high nCount_RNA. Potential high nCount_RNA outliers can be evaluated after initial clustering and annotation, since they may represent either genuine high-RNA cell populations or doublets/technical artifacts; this can guide whether stricter thresholds or doublet-detection tools are needed.

Let’s use a density plot to identify a threshold to filter low quality cells.

metadata %>%
  ggplot(aes(x = nCount_RNA, fill = orig.ident, color = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  scale_x_log10() +
  scale_color_manual(values = sample_colors) +
  scale_fill_manual(values = sample_colors) +
  geom_vline(xintercept = 600,color="red",linetype="dotted")

The multi-modal distribution could indicate techincal or biological differences between cell types. We can stay pretty conservative for now, and then reevaluate post-clustering.

nFeature_RNA (Number of Genes)

Now, let’s take a look at the number of detected features.

VlnPlot(
  lung,
  features = "nFeature_RNA",
  layer = "counts",
  group.by = "orig.ident",
  pt.size = 0.05,
  alpha=0.2
) +
  scale_fill_manual(values = sample_colors) 

metadata %>%
  ggplot(aes(x = nFeature_RNA, fill = orig.ident, color = orig.ident)) +
  geom_density(alpha = 0.2) +
  theme_classic() +
  scale_x_log10() +
  scale_fill_manual(values = sample_colors) +
  scale_color_manual(values = sample_colors) +
  geom_vline(xintercept=c(375,5500), color="red",linetype="dotted")

There is a small left-hand shoulder visible, which could indicate poor quality cells, quiescent cells, or cells of low complexity.

percent.mt

VlnPlot(
  lung,
  features = "percent.mt",
  group.by = "orig.ident",
  pt.size = 0.05,
  alpha= 0.05
) +
  scale_fill_manual(values = condition_colors)
Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.

metadata %>%
  ggplot(aes(x = percent.mt, fill = orig.ident, color = orig.ident)) +
  geom_density(alpha = 0.2) +
  scale_x_log10() + 
  theme_classic() +
  scale_fill_manual(values = sample_colors) +
  scale_color_manual(values = sample_colors) +
  #facet_wrap(~c(treatment)) +
  geom_vline(xintercept=15)
Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning: Removed 516 rows containing non-finite outside the scale range
(`stat_density()`).

Examine the metrics together

To look at how these metrics correlate, we can use FeatureScatter(), which can be used to visualize feature-feature relationships and also be applied to other data stored in our Seurat object (e.g., metadata columns, PC scores).

FeatureScatter(
  lung,
  feature1 = "nCount_RNA",
  feature2 = "nFeature_RNA",
  group.by = "orig.ident",
  split.by = "treatment",
  log = TRUE
) +
  scale_color_manual(values = sample_colors)

FeatureScatter(
  lung,
  feature1 = "percent.mt",
  feature2 = "nFeature_RNA",
  group.by = "orig.ident",
  split.by = "treatment"
) +
  scale_color_manual(values = sample_colors)

metadata %>%
  ggplot(aes(x = nCount_RNA, y = nFeature_RNA, fill = percent.mt > 15)) +
  geom_point(shape = 21, alpha = 0.4, size = 1) +
  theme_classic() +
  scale_x_log10() +
  scale_y_log10() +
  facet_wrap(~orig.ident, ncol = 2) +
  geom_vline(xintercept = 600,color="red",linetype="dotted")+
  geom_hline(yintercept=375,color="red", linetype="dotted")

Decide on Thresholds

There is no single correct threshold set for every experiment. For this dataset, a reasonable and fairly permissive starting point is:

  • nFeature_RNA > 375
  • nCount_RNA > 600
  • percent.mt < 15 for most samples; this could potentially be increased for S89Inf42.

These are only starting values. In a real analysis you would revisit them after clustering and also consider dedicated tools for doublet and ambient RNA detection.

keep_cells <- metadata %>%
  rownames_to_column("cell") %>%
  filter(
    nFeature_RNA > 375,
    nCount_RNA > 600,
    percent.mt < 15) %>%
  pull(cell)

Filter the Object

lung_filt <- subset(lung, cells = keep_cells)
lung_filt
An object of class Seurat 
20329 features across 25823 samples within 1 assay 
Active assay: RNA (20329 features, 0 variable features)
 4 layers present: counts.S89Inf42, counts.S91Hom42, counts.S93Hom3, counts.S99Inf42

Once you have subset your object, it is recommended to run through these plots again. We are skipping this here.

You can quickly compare cell counts before and after filtering:

bind_rows(
  before = metadata,
  after = lung_filt[[]],
  .id = "stage"
) %>%
  dplyr::count(stage, orig.ident) %>%
  ggplot(aes(x = orig.ident, y = n, fill = stage)) +
  geom_col(position = "dodge") +
  theme_classic() +
  geom_text(aes(label = n, group=stage),position = position_dodge(width = 0.9))

Tip

Save the filtered object before moving on. It is often useful to return to this checkpoint later.

saveRDS(lung_filt, "./outputs/merged_Seurat_lung_qcfiltered.rds")

Normalization, Find Variable Features, and Scale

After quality filtering, the next step is to normalize the data and prepare it for dimensionality reduction. In the older Seurat workflow, this often proceeded as:

Why normalization?

The total number of detected transcripts expressed in a cell is dependent on the amount of mRNA in a cell. Cells naturally vary in the total amount of mRNA expressed. However, the chemistry of the assay inserts technical noise that results in variation in the counts of mRNA even between identical cells. Sources of technical variation include differences in cell lysis, reverse transcription efficiency, and stochastic molecular sampling during sequencing, among others. Because differences in observed expression result from a combination of biological and technical variation, we need to minimize or remove the impact of technical effects in order to compare expression levels across cells.

Standard Workflow

The standard normalization method in Seurat is a global-scaling normalization method (?NormalizeData(), default LogNormalize). It divides feature counts for each cell by that cell’s total counts and multiplies by a scale factor, usually 10,000. This is then natural log transformed with a pseudocount of 1.

Following normalization, FindVariableFeatures() identifies genes with high cell-to-cell variation. This focuses downstream steps on features that are likely to carry biological signal.

ScaleData() then centers and scales expression values before PCA. This is standard practice prior to dimensionality reduction. By default, this is applied only to the variable features selected above.

Warning

Scaling only the top variable features can lead to warnings later when visualizing genes that were not scaled. It is also possible to apply ScaleData() to the full dataset, but that is memory intensive.

This is included for comparison. We will use SCTransform() for the lung dataset below.

# Normalize
lung_filt <- NormalizeData(
  lung_filt,
  normalization.method = "LogNormalize",
  scale.factor = 10000
)

# Find variable features
lung_filt <- FindVariableFeatures(
  lung_filt,
  selection.method = "vst",
  nfeatures = 2000
)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(lung_filt), 10)

# Plot variable features with and without labels
plot1 <- VariableFeaturePlot(lung_filt)
LabelPoints(plot = plot1, points = top10, repel = TRUE)

# Scale the data
lung_filt <- ScaleData(lung_filt, vars.to.regress = "percent.mt")

SCTransform (the new standard)

Rather than relying on the three-step workflow above, we will use SCTransform(), which performs normalization, variance stabilization, and feature selection in one step. This is a common Seurat v5 workflow for RNA data.

SCTransform() uses Pearson residuals for transformation, which better accounts for the distribution of single-cell expression values. It also gives more weight to lower-expressed genes that may show cell-type specific expression, instead of allowing highly expressed genes to dominate the analysis. Read more about the method here.

Things to know:

  • Transformed data will be stored in the SCT assay.
  • The SCT assay is set as the default assay after running SCTransform().
  • During normalization, we can regress out potential confounding variables such as mitochondrial mapping percentage.
  • If glmGamPoi is installed, Seurat can use it to make SCTransform() substantially faster. SCTransform Vignette

SCTransform returns 3 layers:

  • counts - the regularized, corrected counts for all genes.
  • data - log-normalized version of the corrected counts.
  • scale.data - Pearson residuals for the variable features (3000 by default) (used for PCA)

Let’s run the function. Notice that we are regressing out the effect of percent mitochondrial reads. If mitochondrial expression is biologically meaningful for your system, compare results with and without this regression.

When you specify vars.to.regress in SCTransform(), those variables are included as additional covariates in the regularized negative binomial regression that SCTransform fits for each gene. The residuals it returns are then computed after accounting for those covariates, so their effect is “regressed out” of the resulting scale.data.

lung_filt <- SCTransform(
  lung_filt,
  vars.to.regress = "percent.mt",
  verbose = FALSE
)
Warning: package 'future' was built under R version 4.5.2

After running SCTransform(), the default assay becomes SCT.

DefaultAssay(lung_filt)
[1] "SCT"
Note

It is important to know which assay is active. You can check or change the default assay as needed:

# Check default assay
DefaultAssay(object = lung_filt)

# Set default assay
DefaultAssay(object = lung_filt) <- "RNA"

The 10x Genomics normalization guide is a useful overview, and the SCTransform vignette provides more details about the method.

PCA

The role of cell clustering is to identify cells with similar transcriptomic profiles. Single-cell RNA-seq data are highly dimensional, with each cell associated with thousands of genes. PCA reduces this complexity by summarizing correlated gene-expression patterns into principal components (PCs).

Seurat uses these PCs as the basis for nearest-neighbor graph construction and clustering. The top PCs act like metafeatures that combine information across correlated genes, which helps reduce noise and speed up downstream analysis. Learn more about PCA here and here.

lung_filt <- RunPCA(lung_filt, assay = "SCT", verbose = FALSE)

Exploring the PCA Results

After running PCA, we need to decide which PCs to include downstream. We want enough PCs to retain biological signal, but not so many that we mostly include noise.

Two useful visualizations are DimHeatmap() and ElbowPlot().

The elbow plot produces a ranking of principle components based on the percentage of variance explained by each one.

ElbowPlot(lung_filt, ndims = 40)

DimHeatmap() allows us to visualize the top genes contributing to each PC. Both cells and genes are sorted by their principal component scores. By default it includes the top 30 genes with the highest and lowest PC loadings.

DimHeatmap(lung_filt, dims = 1:9, cells = 500, balanced = TRUE, ncol = 3)

DimHeatmap(lung_filt, dims = 20:30, cells = 500, balanced = TRUE, ncol = 3)

For this tutorial, we will start with the first 30 PCs. That is a reasonable default for an SCT workflow, but the exact choice should be checked against the elbow plot, PC heatmaps, and downstream stability.

Clustering and UMAP

Clustering is used to group cells by similar transcriptomic profiles. Seurat performs graph-based clustering:

  1. FindNeighbors() builds a KNN graph using euclidean distances in PCA space, refining edge weights between cells using Jaccard similarity.
  2. FindClusters() partitions that graph into communities using modularity optimization.
  3. RunUMAP() provides a 2D visualization of the resulting structure.

There are multiple algorithm options for FindClusters(). The Louvain algorithm is used by default, which performs well in benchmarks. If you want to explore further, its successor the Leiden algorithm yields better-connected communities and is more computationally efficient, making it worth considering for more rigorous workflows. Both methods use a resolution parameter to control the granularity. Higher resolution values produce more clusters.

NoteChoosing the “right” resolution

It is difficult to know whether clustering is “correct” at first glance. The right resolution depends on your scientific question. Are you interested in broad cell classes, subtypes, rare states, or injury- associated programs? Try several resolutions and compare marker genes, metadata, and biological interpretability.

Here, we keep resolution fairly low to get a preliminary view of broad cell types and states.

lung_filt <- FindNeighbors(lung_filt, dims = 1:30)
Computing nearest neighbor graph
Computing SNN
lung_filt <- FindClusters(lung_filt, resolution = 0.2)
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 25823
Number of edges: 870221

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9685
Number of communities: 23
Elapsed time: 2 seconds
lung_filt <- RunUMAP(lung_filt, dims = 1:30)
Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:05:20 UMAP embedding parameters a = 0.9922 b = 1.112
15:05:20 Read 25823 rows and found 30 numeric columns
15:05:20 Using Annoy for neighbor search, n_neighbors = 30
15:05:20 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:05:21 Writing NN index file to temp file /var/folders/hv/m8yfpqmn4sgfw9f0j7xfzjhc3fj3st/T//RtmpiAuEyn/fileca4a74c7efb
15:05:21 Searching Annoy index using 1 thread, search_k = 3000
15:05:26 Annoy recall = 100%
15:05:26 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:05:27 Initializing from normalized Laplacian + noise (using RSpectra)
15:05:28 Commencing optimization for 200 epochs, with 1116462 positive edges
15:05:28 Using rng type: pcg
15:05:35 Optimization finished

Visualize the result:

DimPlot(
  lung_filt,
  reduction = "umap",
  group.by = c("orig.ident", "seurat_clusters", "time_point", "treatment"),
  ncol = 2,
  label = FALSE,
  alpha = 0.2
)

DimPlot(
  lung_filt,
  reduction = "umap",
  group.by = "orig.ident",
  split.by = "treatment",
  ncol = 2,
  label = FALSE,
  alpha = 0.2
)

Are the high ncount_RNA cells forming distinct cell populations? Do they sit between clusters?

# View cells with the highest nCount_RNA
head(lung_filt@meta.data[order(lung_filt$nCount_RNA, decreasing = TRUE), ], 50)
                          orig.ident nCount_RNA nFeature_RNA
S99Inf42_GGAGGATGTCGCCACA   S99Inf42     149774         7983
S93Hom3_TGTCCTGGTAAGCTCT     S93Hom3     102501         6695
S89Inf42_GCCGATGAGGCTAGCA   S89Inf42     101166         8330
S93Hom3_GGTGTTAAGGGCGAGA     S93Hom3      94399         7925
S89Inf42_CTGCAGGCAGGGTCTC   S89Inf42      90959         7914
S99Inf42_GACCCAGTCCCTCTCC   S99Inf42      84601         9010
S89Inf42_GCCCAGATCGAAGCCC   S89Inf42      79785         7787
S99Inf42_TGCTCGTCACCCTGAG   S99Inf42      77016         5122
S93Hom3_TTCCTTCCACGTAACT     S93Hom3      75400         7102
S91Hom42_CCAAGCGTCCTAGCTC   S91Hom42      75025         4609
S99Inf42_AATTTCCCATGTGACT   S99Inf42      72140         9500
S93Hom3_ATTCGTTTCCACTGGG     S93Hom3      71425         5010
S99Inf42_CTTCCTTCATCGAACT   S99Inf42      67695         7355
S93Hom3_GATGAGGAGATGAATC     S93Hom3      65741         6315
S93Hom3_TATTTCGTCGCTCATC     S93Hom3      65088         5124
S89Inf42_TCACTATGTTACACAC   S89Inf42      64915         7060
S93Hom3_CTCAACCGTGGCTGAA     S93Hom3      64624         5347
S93Hom3_CTGTCGTCAGGACTAG     S93Hom3      64358         5714
S99Inf42_CTGAGCGCACATTCGA   S99Inf42      62867         8540
S89Inf42_TAGACCACATCTGGGC   S89Inf42      62550         7613
S99Inf42_ACTGTGATCCTTCTGG   S99Inf42      62070         7293
S89Inf42_TCGGGACAGCGATGGT   S89Inf42      61224         4491
S91Hom42_CGAAGGACAGGATTCT   S91Hom42      60948         5866
S89Inf42_CGATGGCTCTGGCCGA   S89Inf42      59699         8169
S89Inf42_AAAGGATGTATGGTTC   S89Inf42      59611         5522
S99Inf42_GGAATGGGTCAAGCCC   S99Inf42      59446         4917
S93Hom3_AGGAAATAGGGAACAA     S93Hom3      58894         4056
S93Hom3_CTAAGTGGTGAGACCA     S93Hom3      57782         7181
S93Hom3_TCGGGTGTCACAATGC     S93Hom3      56289         4616
S89Inf42_GATGACTTCGGATACT   S89Inf42      56262         7721
S89Inf42_CTAGGTACACTCGATA   S89Inf42      55003         6356
S89Inf42_CTTGAGACACTGTCCT   S89Inf42      53988         7637
S99Inf42_CTTTCGGCATTGGCAT   S99Inf42      53850         7842
S91Hom42_TTGACCCCATCTTCGC   S91Hom42      53626         3531
S99Inf42_TGACCCTTCAGTCTTT   S99Inf42      53527         7076
S89Inf42_TGGAGAGGTCTGTAGT   S89Inf42      53235         7231
S99Inf42_TGAGTCAAGCAAGTGC   S99Inf42      53143         5309
S89Inf42_AGGTTGTAGGGCCCTT   S89Inf42      53078         6341
S93Hom3_GTCAAACAGGTGATAT     S93Hom3      52932         4344
S89Inf42_GCCTGTTAGGTCTTTG   S89Inf42      52766         7807
S99Inf42_ACAGCCGGTACCGTGC   S99Inf42      52560         7829
S93Hom3_AGGTAGGAGGGCAAGG     S93Hom3      52511         3970
S89Inf42_GACTTCCTCTACACTT   S89Inf42      52423         7652
S91Hom42_AAGACTCTCTTTGCGC   S91Hom42      52198         5142
S89Inf42_CCTCATGTCAACTGAC   S89Inf42      51731         7527
S91Hom42_GAAACCTAGCGGGTAT   S91Hom42      51704         3813
S89Inf42_ACATCCCTCCATAGAC   S89Inf42      51671         5030
S93Hom3_ATTTACCTCATTTGGG     S93Hom3      51639         4329
S99Inf42_GGTTCTCTCAACGCTA   S99Inf42      51596         8059
S99Inf42_GTTTGGAGTGTCTAAC   S99Inf42      51168         5279
                                    treatment        time_point
S99Inf42_GGAGGATGTCGCCACA Tamoxifen_Influenza   Infection_Day42
S93Hom3_TGTCCTGGTAAGCTCT            Tamoxifen  Homeostasis_Day3
S89Inf42_GCCGATGAGGCTAGCA Tamoxifen_Influenza   Infection_Day42
S93Hom3_GGTGTTAAGGGCGAGA            Tamoxifen  Homeostasis_Day3
S89Inf42_CTGCAGGCAGGGTCTC Tamoxifen_Influenza   Infection_Day42
S99Inf42_GACCCAGTCCCTCTCC Tamoxifen_Influenza   Infection_Day42
S89Inf42_GCCCAGATCGAAGCCC Tamoxifen_Influenza   Infection_Day42
S99Inf42_TGCTCGTCACCCTGAG Tamoxifen_Influenza   Infection_Day42
S93Hom3_TTCCTTCCACGTAACT            Tamoxifen  Homeostasis_Day3
S91Hom42_CCAAGCGTCCTAGCTC           Tamoxifen Homeostasis_Day42
S99Inf42_AATTTCCCATGTGACT Tamoxifen_Influenza   Infection_Day42
S93Hom3_ATTCGTTTCCACTGGG            Tamoxifen  Homeostasis_Day3
S99Inf42_CTTCCTTCATCGAACT Tamoxifen_Influenza   Infection_Day42
S93Hom3_GATGAGGAGATGAATC            Tamoxifen  Homeostasis_Day3
S93Hom3_TATTTCGTCGCTCATC            Tamoxifen  Homeostasis_Day3
S89Inf42_TCACTATGTTACACAC Tamoxifen_Influenza   Infection_Day42
S93Hom3_CTCAACCGTGGCTGAA            Tamoxifen  Homeostasis_Day3
S93Hom3_CTGTCGTCAGGACTAG            Tamoxifen  Homeostasis_Day3
S99Inf42_CTGAGCGCACATTCGA Tamoxifen_Influenza   Infection_Day42
S89Inf42_TAGACCACATCTGGGC Tamoxifen_Influenza   Infection_Day42
S99Inf42_ACTGTGATCCTTCTGG Tamoxifen_Influenza   Infection_Day42
S89Inf42_TCGGGACAGCGATGGT Tamoxifen_Influenza   Infection_Day42
S91Hom42_CGAAGGACAGGATTCT           Tamoxifen Homeostasis_Day42
S89Inf42_CGATGGCTCTGGCCGA Tamoxifen_Influenza   Infection_Day42
S89Inf42_AAAGGATGTATGGTTC Tamoxifen_Influenza   Infection_Day42
S99Inf42_GGAATGGGTCAAGCCC Tamoxifen_Influenza   Infection_Day42
S93Hom3_AGGAAATAGGGAACAA            Tamoxifen  Homeostasis_Day3
S93Hom3_CTAAGTGGTGAGACCA            Tamoxifen  Homeostasis_Day3
S93Hom3_TCGGGTGTCACAATGC            Tamoxifen  Homeostasis_Day3
S89Inf42_GATGACTTCGGATACT Tamoxifen_Influenza   Infection_Day42
S89Inf42_CTAGGTACACTCGATA Tamoxifen_Influenza   Infection_Day42
S89Inf42_CTTGAGACACTGTCCT Tamoxifen_Influenza   Infection_Day42
S99Inf42_CTTTCGGCATTGGCAT Tamoxifen_Influenza   Infection_Day42
S91Hom42_TTGACCCCATCTTCGC           Tamoxifen Homeostasis_Day42
S99Inf42_TGACCCTTCAGTCTTT Tamoxifen_Influenza   Infection_Day42
S89Inf42_TGGAGAGGTCTGTAGT Tamoxifen_Influenza   Infection_Day42
S99Inf42_TGAGTCAAGCAAGTGC Tamoxifen_Influenza   Infection_Day42
S89Inf42_AGGTTGTAGGGCCCTT Tamoxifen_Influenza   Infection_Day42
S93Hom3_GTCAAACAGGTGATAT            Tamoxifen  Homeostasis_Day3
S89Inf42_GCCTGTTAGGTCTTTG Tamoxifen_Influenza   Infection_Day42
S99Inf42_ACAGCCGGTACCGTGC Tamoxifen_Influenza   Infection_Day42
S93Hom3_AGGTAGGAGGGCAAGG            Tamoxifen  Homeostasis_Day3
S89Inf42_GACTTCCTCTACACTT Tamoxifen_Influenza   Infection_Day42
S91Hom42_AAGACTCTCTTTGCGC           Tamoxifen Homeostasis_Day42
S89Inf42_CCTCATGTCAACTGAC Tamoxifen_Influenza   Infection_Day42
S91Hom42_GAAACCTAGCGGGTAT           Tamoxifen Homeostasis_Day42
S89Inf42_ACATCCCTCCATAGAC Tamoxifen_Influenza   Infection_Day42
S93Hom3_ATTTACCTCATTTGGG            Tamoxifen  Homeostasis_Day3
S99Inf42_GGTTCTCTCAACGCTA Tamoxifen_Influenza   Infection_Day42
S99Inf42_GTTTGGAGTGTCTAAC Tamoxifen_Influenza   Infection_Day42
                                                 treatment_tp percent.mt
S99Inf42_GGAGGATGTCGCCACA Tamoxifen_Influenza_Infection_Day42   2.206658
S93Hom3_TGTCCTGGTAAGCTCT           Tamoxifen_Homeostasis_Day3   2.756071
S89Inf42_GCCGATGAGGCTAGCA Tamoxifen_Influenza_Infection_Day42   1.200996
S93Hom3_GGTGTTAAGGGCGAGA           Tamoxifen_Homeostasis_Day3   2.725664
S89Inf42_CTGCAGGCAGGGTCTC Tamoxifen_Influenza_Infection_Day42   2.951879
S99Inf42_GACCCAGTCCCTCTCC Tamoxifen_Influenza_Infection_Day42   1.320315
S89Inf42_GCCCAGATCGAAGCCC Tamoxifen_Influenza_Infection_Day42   3.154728
S99Inf42_TGCTCGTCACCCTGAG Tamoxifen_Influenza_Infection_Day42   1.890516
S93Hom3_TTCCTTCCACGTAACT           Tamoxifen_Homeostasis_Day3   1.133952
S91Hom42_CCAAGCGTCCTAGCTC         Tamoxifen_Homeostasis_Day42   2.088637
S99Inf42_AATTTCCCATGTGACT Tamoxifen_Influenza_Infection_Day42   2.263654
S93Hom3_ATTCGTTTCCACTGGG           Tamoxifen_Homeostasis_Day3   2.293315
S99Inf42_CTTCCTTCATCGAACT Tamoxifen_Influenza_Infection_Day42   1.236428
S93Hom3_GATGAGGAGATGAATC           Tamoxifen_Homeostasis_Day3   1.832951
S93Hom3_TATTTCGTCGCTCATC           Tamoxifen_Homeostasis_Day3   1.975787
S89Inf42_TCACTATGTTACACAC Tamoxifen_Influenza_Infection_Day42   2.871447
S93Hom3_CTCAACCGTGGCTGAA           Tamoxifen_Homeostasis_Day3   2.217442
S93Hom3_CTGTCGTCAGGACTAG           Tamoxifen_Homeostasis_Day3   1.532055
S99Inf42_CTGAGCGCACATTCGA Tamoxifen_Influenza_Infection_Day42   1.560437
S89Inf42_TAGACCACATCTGGGC Tamoxifen_Influenza_Infection_Day42   3.283773
S99Inf42_ACTGTGATCCTTCTGG Tamoxifen_Influenza_Infection_Day42   3.657161
S89Inf42_TCGGGACAGCGATGGT Tamoxifen_Influenza_Infection_Day42   1.246243
S91Hom42_CGAAGGACAGGATTCT         Tamoxifen_Homeostasis_Day42   4.210146
S89Inf42_CGATGGCTCTGGCCGA Tamoxifen_Influenza_Infection_Day42   2.299871
S89Inf42_AAAGGATGTATGGTTC Tamoxifen_Influenza_Infection_Day42   2.316687
S99Inf42_GGAATGGGTCAAGCCC Tamoxifen_Influenza_Infection_Day42   2.284426
S93Hom3_AGGAAATAGGGAACAA           Tamoxifen_Homeostasis_Day3   1.292152
S93Hom3_CTAAGTGGTGAGACCA           Tamoxifen_Homeostasis_Day3   2.390018
S93Hom3_TCGGGTGTCACAATGC           Tamoxifen_Homeostasis_Day3   1.176073
S89Inf42_GATGACTTCGGATACT Tamoxifen_Influenza_Infection_Day42   1.942697
S89Inf42_CTAGGTACACTCGATA Tamoxifen_Influenza_Infection_Day42   2.259877
S89Inf42_CTTGAGACACTGTCCT Tamoxifen_Influenza_Infection_Day42   2.141217
S99Inf42_CTTTCGGCATTGGCAT Tamoxifen_Influenza_Infection_Day42   2.378830
S91Hom42_TTGACCCCATCTTCGC         Tamoxifen_Homeostasis_Day42   1.131914
S99Inf42_TGACCCTTCAGTCTTT Tamoxifen_Influenza_Infection_Day42   4.173595
S89Inf42_TGGAGAGGTCTGTAGT Tamoxifen_Influenza_Infection_Day42   2.783883
S99Inf42_TGAGTCAAGCAAGTGC Tamoxifen_Influenza_Infection_Day42   2.022844
S89Inf42_AGGTTGTAGGGCCCTT Tamoxifen_Influenza_Infection_Day42   3.619202
S93Hom3_GTCAAACAGGTGATAT           Tamoxifen_Homeostasis_Day3   1.209098
S89Inf42_GCCTGTTAGGTCTTTG Tamoxifen_Influenza_Infection_Day42   3.684191
S99Inf42_ACAGCCGGTACCGTGC Tamoxifen_Influenza_Infection_Day42   1.953957
S93Hom3_AGGTAGGAGGGCAAGG           Tamoxifen_Homeostasis_Day3   2.237626
S89Inf42_GACTTCCTCTACACTT Tamoxifen_Influenza_Infection_Day42   1.287603
S91Hom42_AAGACTCTCTTTGCGC         Tamoxifen_Homeostasis_Day42   1.312311
S89Inf42_CCTCATGTCAACTGAC Tamoxifen_Influenza_Infection_Day42   1.650848
S91Hom42_GAAACCTAGCGGGTAT         Tamoxifen_Homeostasis_Day42   1.522126
S89Inf42_ACATCCCTCCATAGAC Tamoxifen_Influenza_Infection_Day42   2.254650
S93Hom3_ATTTACCTCATTTGGG           Tamoxifen_Homeostasis_Day3   1.413660
S99Inf42_GGTTCTCTCAACGCTA Tamoxifen_Influenza_Infection_Day42   1.876114
S99Inf42_GTTTGGAGTGTCTAAC Tamoxifen_Influenza_Infection_Day42   4.065041
                          nCount_SCT nFeature_SCT SCT_snn_res.0.2
S99Inf42_GGAGGATGTCGCCACA       7473          720              17
S93Hom3_TGTCCTGGTAAGCTCT        8595          697              13
S89Inf42_GCCGATGAGGCTAGCA       8599         2738              13
S93Hom3_GGTGTTAAGGGCGAGA        7869         1303              17
S89Inf42_CTGCAGGCAGGGTCTC      10584         1912              17
S99Inf42_GACCCAGTCCCTCTCC       5368         2663              14
S89Inf42_GCCCAGATCGAAGCCC       9425         2063              13
S99Inf42_TGCTCGTCACCCTGAG       7640          527              13
S93Hom3_TTCCTTCCACGTAACT        7852         1173              17
S91Hom42_CCAAGCGTCCTAGCTC       7653          482              13
S99Inf42_AATTTCCCATGTGACT       5643         3201              17
S93Hom3_ATTCGTTTCCACTGGG        8598          511              13
S99Inf42_CTTCCTTCATCGAACT       6574         1938              20
S93Hom3_GATGAGGAGATGAATC        7962          917              17
S93Hom3_TATTTCGTCGCTCATC        8347          609              13
S89Inf42_TCACTATGTTACACAC       8690         2441              13
S93Hom3_CTCAACCGTGGCTGAA        8207          645              13
S93Hom3_CTGTCGTCAGGACTAG        8373          695              17
S99Inf42_CTGAGCGCACATTCGA       5556         2894              14
S89Inf42_TAGACCACATCTGGGC       8178         2702              13
S99Inf42_ACTGTGATCCTTCTGG       6748         1477              17
S89Inf42_TCGGGACAGCGATGGT      11718          698              13
S91Hom42_CGAAGGACAGGATTCT       8013         1287              13
S89Inf42_CGATGGCTCTGGCCGA       7406         3427              14
S89Inf42_AAAGGATGTATGGTTC       8622         1873               8
S99Inf42_GGAATGGGTCAAGCCC       7521          455              13
S93Hom3_AGGAAATAGGGAACAA        8810          374              13
S93Hom3_CTAAGTGGTGAGACCA        7268         1728              17
S93Hom3_TCGGGTGTCACAATGC        8349          563              13
S89Inf42_GATGACTTCGGATACT       7363         3071               1
S89Inf42_CTAGGTACACTCGATA       8885         2094              13
S89Inf42_CTTGAGACACTGTCCT       7228         3188              14
S99Inf42_CTTTCGGCATTGGCAT       6112         2759               4
S91Hom42_TTGACCCCATCTTCGC       7838          433              13
S99Inf42_TGACCCTTCAGTCTTT       6292         1988              13
S89Inf42_TGGAGAGGTCTGTAGT       8375         2608              13
S99Inf42_TGAGTCAAGCAAGTGC       7202          861              17
S89Inf42_AGGTTGTAGGGCCCTT       8090         2001              13
S93Hom3_GTCAAACAGGTGATAT        8413          585              13
S89Inf42_GCCTGTTAGGTCTTTG       7763         3379               4
S99Inf42_ACAGCCGGTACCGTGC       5647         2980              14
S93Hom3_AGGTAGGAGGGCAAGG        8546          453              13
S89Inf42_GACTTCCTCTACACTT       7807         3447              12
S91Hom42_AAGACTCTCTTTGCGC       7736          978              13
S89Inf42_CCTCATGTCAACTGAC       7300         3191              14
S91Hom42_GAAACCTAGCGGGTAT       7729          545              13
S89Inf42_ACATCCCTCCATAGAC       9279         1009              13
S93Hom3_ATTTACCTCATTTGGG        8297          562              13
S99Inf42_GGTTCTCTCAACGCTA       5679         3021              12
S99Inf42_GTTTGGAGTGTCTAAC       7136          949              17
                          seurat_clusters
S99Inf42_GGAGGATGTCGCCACA              17
S93Hom3_TGTCCTGGTAAGCTCT               13
S89Inf42_GCCGATGAGGCTAGCA              13
S93Hom3_GGTGTTAAGGGCGAGA               17
S89Inf42_CTGCAGGCAGGGTCTC              17
S99Inf42_GACCCAGTCCCTCTCC              14
S89Inf42_GCCCAGATCGAAGCCC              13
S99Inf42_TGCTCGTCACCCTGAG              13
S93Hom3_TTCCTTCCACGTAACT               17
S91Hom42_CCAAGCGTCCTAGCTC              13
S99Inf42_AATTTCCCATGTGACT              17
S93Hom3_ATTCGTTTCCACTGGG               13
S99Inf42_CTTCCTTCATCGAACT              20
S93Hom3_GATGAGGAGATGAATC               17
S93Hom3_TATTTCGTCGCTCATC               13
S89Inf42_TCACTATGTTACACAC              13
S93Hom3_CTCAACCGTGGCTGAA               13
S93Hom3_CTGTCGTCAGGACTAG               17
S99Inf42_CTGAGCGCACATTCGA              14
S89Inf42_TAGACCACATCTGGGC              13
S99Inf42_ACTGTGATCCTTCTGG              17
S89Inf42_TCGGGACAGCGATGGT              13
S91Hom42_CGAAGGACAGGATTCT              13
S89Inf42_CGATGGCTCTGGCCGA              14
S89Inf42_AAAGGATGTATGGTTC               8
S99Inf42_GGAATGGGTCAAGCCC              13
S93Hom3_AGGAAATAGGGAACAA               13
S93Hom3_CTAAGTGGTGAGACCA               17
S93Hom3_TCGGGTGTCACAATGC               13
S89Inf42_GATGACTTCGGATACT               1
S89Inf42_CTAGGTACACTCGATA              13
S89Inf42_CTTGAGACACTGTCCT              14
S99Inf42_CTTTCGGCATTGGCAT               4
S91Hom42_TTGACCCCATCTTCGC              13
S99Inf42_TGACCCTTCAGTCTTT              13
S89Inf42_TGGAGAGGTCTGTAGT              13
S99Inf42_TGAGTCAAGCAAGTGC              17
S89Inf42_AGGTTGTAGGGCCCTT              13
S93Hom3_GTCAAACAGGTGATAT               13
S89Inf42_GCCTGTTAGGTCTTTG               4
S99Inf42_ACAGCCGGTACCGTGC              14
S93Hom3_AGGTAGGAGGGCAAGG               13
S89Inf42_GACTTCCTCTACACTT              12
S91Hom42_AAGACTCTCTTTGCGC              13
S89Inf42_CCTCATGTCAACTGAC              14
S91Hom42_GAAACCTAGCGGGTAT              13
S89Inf42_ACATCCCTCCATAGAC              13
S93Hom3_ATTTACCTCATTTGGG               13
S99Inf42_GGTTCTCTCAACGCTA              12
S99Inf42_GTTTGGAGTGTCTAAC              17
# Or define a high-count subset, for example top 1%
cutoff <- quantile(lung_filt$nCount_RNA, 0.99)
high_count_cells <- WhichCells(lung_filt, expression = nCount_RNA > cutoff)

length(high_count_cells)
[1] 259
head(high_count_cells)
[1] "S89Inf42_AAAGAACTCGGAGATG" "S89Inf42_AAAGGATGTATGGTTC"
[3] "S89Inf42_AATTCCTCAACAAGTA" "S89Inf42_ACATCCCTCCATAGAC"
[5] "S89Inf42_ACCTACCGTTATGGTC" "S89Inf42_ACCTGTCAGCACAAAT"
table(Idents(lung_filt)[high_count_cells])

 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 
 3  2 38  0 16  3  3  4  6  0  0  1  9 93 48  0  0 22  0  0  9  2  0 
DimPlot(lung_filt, cells.highlight = list(high_umi=high_count_cells), split.by="orig.ident", label=TRUE)

Do you have canonical markers?

Canonical markers are curated, biologically validated genes known to define specific cell types (e.g., PTPRC for immune cells, EPCAM for epithelial cells). You can use these directly to annotate clusters by visualizing expression with FeaturePlot(), VlnPlot(), or DotPlot(). Look for which clusters highly express your markers of interest. You can combine this with FindAllMarkers to statistically confirm annotations as well as other tools to annotate your clusters.

FeaturePlot(lung_filt, features = c( "Epcam","Ptprc"))

Finding Cluster Biomarkers

Once we have clusters, we can use differential expression analysis to find potential marker genes that define each cluster. These marker genes are useful for assigning broad cell identities and for checking whether the clustering resolution is biologically sensible.

Because we are working with an SCT assay, run PrepSCTFindMarkers() before differential expression testing. This corrects the counts and joins the data when the seurat object includes multiple samples and multiple SCT models.

lung_filt <- PrepSCTFindMarkers(lung_filt, verbose = TRUE)
Found 4 SCT models. Recorrecting SCT counts using minimum median counts: 5378

Now test each cluster against all other cells using FindAllMarkers. only.pos = TRUE keeps genes that are enriched in the cluster.

Note

There are many options for differential expression testing. See the test.use parameter in ?FindAllMarkers(). By default, Seurat uses a Wilcoxon Rank Sum test.

Warningpresto

The R package, presto is used to speed up the DE analysis; it is not required but is recommended. You may encounter issues when installing presto on Biowulf due to dependency requirements. You may find that downgrading RcppArmadillo to an earlier version, for example, “14.0.2-1” may help (remotes::install_version("RcppArmadillo", version = "14.0.2-1", repos = "https://cran.r-project.org")). If not, feel free to email us at .

lung_filt_markers <- FindAllMarkers(lung_filt, only.pos = TRUE)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Calculating cluster 15
Calculating cluster 16
Calculating cluster 17
Calculating cluster 18
Calculating cluster 19
Calculating cluster 20
Calculating cluster 21
Calculating cluster 22
lung_filt_markers <- lung_filt_markers |>
  group_by(cluster) |>
  arrange(desc(avg_log2FC), pct.1, p_val_adj, .by_group=TRUE)

lung_filt_markers |>
  dplyr::filter(avg_log2FC > 1) |>
  slice_head(n = 10) |>
  ungroup()
# A tibble: 230 × 7
       p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene    
       <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
 1 2.09e-101       3.19 0.043 0.006 4.26e- 97 0       Cd1d2   
 2 3.27e-118       3.03 0.055 0.008 6.64e-114 0       Gm47625 
 3 8.52e- 24       2.98 0.01  0.001 1.73e- 19 0       Gm13715 
 4 5.68e- 50       2.94 0.025 0.004 1.15e- 45 0       Slco1c1 
 5 4.80e- 24       2.80 0.012 0.002 9.76e- 20 0       Gm50242 
 6 1.70e- 21       2.73 0.01  0.002 3.45e- 17 0       Lalba   
 7 9.38e- 82       2.69 0.042 0.007 1.91e- 77 0       Gm14221 
 8 1.73e- 73       2.59 0.042 0.008 3.51e- 69 0       Gm43504 
 9 9.74e- 53       2.58 0.027 0.005 1.98e- 48 0       Slc26a10
10 6.76e- 40       2.57 0.023 0.005 1.38e- 35 0       Gm31834 
# ℹ 220 more rows

In the marker table, pct.1 is the percentage of cells in the cluster where the gene is detected, while pct.2 is the percentage of cells on average in all the other clusters where the gene is detected. A strong cluster marker is usually enriched in one cluster and low in others. You should also see a high avg_log2FC between your cluster of interest and all other clusters.

You can combine these results with visualizations such as heat maps (DoHeatmap()), violin plots (VlnPlot()), feature scatter plots (FeaturePlot()), and dot plots (DotPlot()).

These markers are only a starting point. Final annotation should combine marker expression, known biology, and results from reference-based annotation tools.

Cell Annotation

There are many tools and strategies available for cell annotation. A few commonly used options include:

Automated annotation should be treated as decision support, not ground truth. Always validate proposed labels against marker genes, QC metrics, metadata, and the biological context of the lung regeneration dataset.

Save the Clustered Object

saveRDS(lung_filt, "./outputs/merged_Seurat_lung_sct_clustered.rds")

Next Steps

At this point, ask:

  • Do clusters mix samples reasonably well?
  • Are cells grouping by biology or mostly by sample?
  • Do treatment and time point seem to explain expected structure?
  • Do known lung marker genes help explain major clusters?

If cells cluster mainly by sample rather than biological state, that is when it becomes time to consider an integration workflow.

At this stage, we have completed a standard preprocessing workflow: quality control, normalization, dimensional reduction, clustering, visualization, and an initial marker-gene workflow. This gives us a useful first view of the structure in the dataset, but it is only the beginning of a full single-cell analysis.

Typical next steps may include:

  • Integration - if cells cluster primarily by sample or batch rather than biology, consider an integration workflow.
  • Cluster annotation - identify the major cell populations using marker genes, reference datasets, and annotation tools.
  • Subsetting and re-clustering - once broad cell classes are identified, subset populations of interest and repeat clustering at higher resolution.
  • Differential expression - compare clusters, conditions, or time points to identify genes associated with particular cell states or experimental effects.
  • Cell composition analysis - test whether the proportions of major cell types differ across treatment groups or time points.
  • Trajectory or state-transition analysis - for dynamic systems such as injury and regeneration, model transitions between cell states.
Note

There is no single correct order for all downstream analyses. The right next step depends on your biological question, the quality of the clustering, and whether technical effects appear to be influencing the results.

References

The following resources were particularly helpful for this lesson: