Core Steps in scRNA-seq Analysis: QC to Clustering
Learning Objectives
This tutorial continues from “scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat”, and uses the same lung scRNA-seq dataset and Seurat object structure.
Throughout this tutorial we will:
- Apply quality control filters to retain high-quality cells
- Normalize the data with
SCTransform() - Perform dimensionality reduction
- Cluster
This lesson follows the standard Seurat workflow for Seurat v5 layered assays but with a different dataset.
This tutorial demonstrates common pre-processing steps, not a complete analysis. All thresholds and clustering decisions shown here should be treated as starting points that need to be revisited in the context of the biology.
The Data
As in Part 1, we are working with a subset of the lung regeneration dataset from Niethamer et al. 2025, Longitudinal single-cell profiles of lung regeneration after viral infection reveal persistent injury-associated cell states.
The four samples used here are:
| Sample | Species | Genotype | Treatment | Time Point | Status |
|---|---|---|---|---|---|
GSM8181589 (S89Inf42) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen + Influenza | Day 42 post-infection | Lung (post-infection) |
GSM8181591 (S91Hom42) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen | Homeostasis Day 42 | Lung (homeostasis) |
GSM8181593 (S93Hom3) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen | Homeostasis Day 3 | Lung (homeostasis) |
GSM8181599 (S99Inf42) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen + Influenza | Day 42 post-infection | Lung (post-infection) |
This subset includes lung tissue collected at two homeostasis time points (day 3 and day 42) and two samples from the post-infection condition at day 42. The two post-infection samples (S89Inf42 and S99Inf42) are biological replicates of the infected condition at day 42, enabling comparison of injury-associated lung cell states against homeostatic controls.
We should expect both:
- technical differences between samples
- genuine biological differences across treatment
This is why QC should be assessed per sample even though we are working with a merged object.
An R project directory containing these data and associated files can be downloaded here.
Load R Packages.
Load the Seurat Object
In “scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat” we created and saved a merged Seurat object to ./outputs/merged_Seurat_lung.rds. Here, we will load and continue to work from this object.
lung <- readRDS("./outputs/merged_Seurat_lung.rds")Examine the object:
lungAn object of class Seurat
20329 features across 28494 samples within 1 assay
Active assay: RNA (20329 features, 0 variable features)
4 layers present: counts.S89Inf42, counts.S91Hom42, counts.S93Hom3, counts.S99Inf42
glimpse(lung)Formal class 'Seurat' [package "SeuratObject"] with 13 slots
..@ assays :List of 1
.. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
..@ meta.data :'data.frame': 28494 obs. of 6 variables:
.. ..$ orig.ident : chr [1:28494] "S89Inf42" "S89Inf42" "S89Inf42" "S89Inf42" ...
.. ..$ nCount_RNA : num [1:28494] 5341 3896 9012 6752 1339 ...
.. ..$ nFeature_RNA: int [1:28494] 2214 1513 2617 2595 201 3162 2097 1152 3149 3705 ...
.. ..$ treatment : chr [1:28494] "Tamoxifen_Influenza" "Tamoxifen_Influenza" "Tamoxifen_Influenza" "Tamoxifen_Influenza" ...
.. ..$ time_point : chr [1:28494] "Infection_Day42" "Infection_Day42" "Infection_Day42" "Infection_Day42" ...
.. ..$ treatment_tp: chr [1:28494] "Tamoxifen_Influenza_Infection_Day42" "Tamoxifen_Influenza_Infection_Day42" "Tamoxifen_Influenza_Infection_Day42" "Tamoxifen_Influenza_Infection_Day42" ...
..@ active.assay: chr "RNA"
..@ active.ident: Factor w/ 4 levels "S89Inf42","S91Hom42",..: 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "names")= chr [1:28494] "S89Inf42_AAACCCAAGCAACAGC" "S89Inf42_AAACCCACAACTAGAA" "S89Inf42_AAACCCACATCGCTGG" "S89Inf42_AAACCCACATGGAAGC" ...
..@ graphs : list()
..@ neighbors : list()
..@ reductions : list()
..@ images : list()
..@ project.name: chr "Lung"
..@ misc : list()
..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
.. ..$ : int [1:3] 5 4 0
..@ commands : list()
..@ tools : list()
Layers(lung)[1] "counts.S89Inf42" "counts.S91Hom42" "counts.S93Hom3" "counts.S99Inf42"
In Seurat v5, merged RNA assays retain per-sample matrices as separate layers such as counts.S89Inf42 and counts.S91Hom42. That is why the previous seminar emphasized layers so heavily, and it is why this merged-object workflow still supports per-sample QC.
Quality Control
The goal of quality control is to keep only high quality cells (i.e., remove low quality cells (dead or dying cells), cell-free RNA, or doublets (droplets containing two cells)). Low quality cells will impact downstream analyses. The most common starting metrics are:
-
nCount_RNA- total UMI count per cell, representing the number of captured/countable transcript molecules detected in that cell.- high total count - potential doublets or multiplets
- low total count - potential ambient mRNA (not real cells)
- Cell Ranger threshold set at 500 UMIs
NoteWhat is a UMI?Unique Molecular Identifiers (UMIs) are short barcode sequences added to each captured RNA molecule before amplification. They are used to identify PCR duplicates so sequencing reads from the same original molecule can be collapsed into one count.
-
nFeature_RNA- number of genes expressed (detected) per cell.- high number of detected genes - potential doublets or multiplets
- low number of detected genes - potential ambient mRNA (not real cells)
- high number of detected genes - potential doublets or multiplets
-
Percent mitochondrial(
percent.mt) - the fraction of reads from mitochondrial genes per cell- high mtDNA - cellular degradation
Broadly:
Low quality cells and empty droplets will have few genes (low n_feature_RNA, low nCount_RNA), whereas doublets and multiplets will exhibit high n_feature_RNA and high nCount_RNA. Low quality or dying cells will also have high percent_mt.
BUT
There are also biological explanations for some observed patterns.
Cells with a relatively high fraction of mitochondrial counts might for example be involved in respiratory processes and should not be filtered out. Whereas, cells with low or high counts might correspond to quiescent cell populations or cells larger in size. — Single Cell Best Practices
Therefore, it is important to know your biology, and have some idea of expected cell types.
Doublets and multiplets are not easy to predict. There are dedicated tools for this purpose (e.g., DoubletFinder, scDblFinder,scrublet (python package)), which you may want to integrate into your workflow. Eliminating ambient RNA is also not as straight forward as it may seem. Ambient RNA can be included in GEMs along with intact cells. There are tools that can be used to predict and remove ambient RNA (e.g., SoupX and DecontX).
Even in a merged object, QC should be evaluated by sample. Different samples can have different count depths or mitochondrial distributions, and some of those differences may reflect biology rather than technical failure.
Yes, it is possible to introduce bias when working with samples independently. There can be variation between the quality of different samples. By not adjusting to the specific sample, you can just as easily introduce bias by keeping or removing different populations of cells for each of the samples. Best case scenario is all samples look similar and you can employ a standard filtering threshold across all samples.
- fraction of ribosomal protein genes - higher due to degradation or contamination
- Complexity (Novelty Score) - ratio of genes detected to UMI - greater gene to UMI count equals higher complexity.
QC metrics are stored as metadata
The quality metric information is stored in the metadata (lung@meta.data or lung[[]]), so we only need to work with the metadata file to get an idea of the thresholds we want to set for quality control. We will extract this information after we calculate the percentage of mitochondrial genes.
nCount_RNA and nFeature_RNA are already available to view, so we need to add the percentage of reads that mapped to the mitochondrial genome.
Take care when filtering cells during QC. Heavy filtering may result in the loss of rare cell populations, while minimal filtering, may impact our ability to annotate cell populations downstream. It’s best to start with conservative QC filtering, which can then be reassessed following downstream steps.
Add percent mitochondrial
To calculate percent.mt, we use PercentageFeatureSet(), which calculates the percentage of all the counts belonging to a subset of the possible features (e.g., mitochondrial genes, ribosomal genes) for each cell.
For this mouse dataset, mitochondrial genes begin with mt-.
lung[["percent.mt"]] <- PercentageFeatureSet(lung, pattern = "^mt-")Extract metadata
Using lung[[]] is the most convenient v5-friendly way to work with cell metadata.
metadata <- lung[[]]
head(metadata) orig.ident nCount_RNA nFeature_RNA
S89Inf42_AAACCCAAGCAACAGC S89Inf42 5341 2214
S89Inf42_AAACCCACAACTAGAA S89Inf42 3896 1513
S89Inf42_AAACCCACATCGCTGG S89Inf42 9012 2617
S89Inf42_AAACCCACATGGAAGC S89Inf42 6752 2595
S89Inf42_AAACCCAGTATGGAGC S89Inf42 1339 201
S89Inf42_AAACCCAGTCATCACA S89Inf42 14257 3162
treatment time_point
S89Inf42_AAACCCAAGCAACAGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCACAACTAGAA Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCACATCGCTGG Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCACATGGAAGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCAGTATGGAGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AAACCCAGTCATCACA Tamoxifen_Influenza Infection_Day42
treatment_tp percent.mt
S89Inf42_AAACCCAAGCAACAGC Tamoxifen_Influenza_Infection_Day42 4.755664
S89Inf42_AAACCCACAACTAGAA Tamoxifen_Influenza_Infection_Day42 1.899384
S89Inf42_AAACCCACATCGCTGG Tamoxifen_Influenza_Infection_Day42 4.627164
S89Inf42_AAACCCACATGGAAGC Tamoxifen_Influenza_Infection_Day42 5.687204
S89Inf42_AAACCCAGTATGGAGC Tamoxifen_Influenza_Infection_Day42 65.123226
S89Inf42_AAACCCAGTCATCACA Tamoxifen_Influenza_Infection_Day42 5.337729
QC assessment
The quality of the cells should be assessed considering the above metrics jointly and not simply in isolation.
QC can be applied manually by subjectively choosing and applying thresholds, which can be arbitrary. Alternatively, QC can be applied automatically using adaptive thresholds like the mean absolute deviation (MAD) (outliers generally 3 or 5 MAD from the median) (See scuttle::isOutlier()). You can find more information on adaptive thresholds here.
Here, we are going to start with applying thresholds based on a subjective visual assessment of quality metrics.
Step 1: Visualize the data
There are several built in functions for visualizing data with Seurat. We can use violin plots and scatter plots to check out the individual distributions and correlations between metrics. We can also use other plot types beyond Seurat (for example, density plots).
Check out scCustomize for more options for visualizing single-cell data.
Set sample colors once so our plots stay consistent:
sample_colors <- c(
S89Inf42 = "#0072B2", # blue
S91Hom42 = "#E69F00", # orange
S93Hom3 = "#009E73", # bluish green
S99Inf42 = "#CC79A7" # reddish purple
)Create factors and set colors by condition:
Rows: 28,494
Columns: 7
$ orig.ident <fct> S89Inf42, S89Inf42, S89Inf42, S89Inf42, S89Inf42, S89Inf4…
$ nCount_RNA <dbl> 5341, 3896, 9012, 6752, 1339, 14257, 8647, 3112, 12509, 1…
$ nFeature_RNA <int> 2214, 1513, 2617, 2595, 201, 3162, 2097, 1152, 3149, 3705…
$ treatment <fct> Tamoxifen_Influenza, Tamoxifen_Influenza, Tamoxifen_Influ…
$ time_point <fct> Infection_Day42, Infection_Day42, Infection_Day42, Infect…
$ treatment_tp <fct> Tamoxifen_Influenza_Infection_Day42, Tamoxifen_Influenza_…
$ percent.mt <dbl> 4.755664, 1.899384, 4.627164, 5.687204, 65.123226, 5.3377…
nCount_RNA
Look at the distribution of nCount_RNA with a Violin plot:
# plot
VlnPlot(lung, features = "nCount_RNA", layer="counts",
group.by="orig.ident", alpha=0.05) +
scale_fill_manual(values=condition_colors) +
scale_y_continuous(breaks=seq(0,150000,by=10000))Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
A violin plot shows the distribution of a numeric variable across categorical variables of interest; it is a hybrid between a box plot and a density plot. The wider sections on the plot are associated with increased probabilities of data points being found at that location in the distribution. VlnPlot, a Seurat function that uses the Seurat object directly, overlays the individual data points to ease the interpretation of the figure.
Here we can see that the greatest density of cells have between 0 and 5000 UMIs, with most of the cells having less than 30,000 total UMIs.
We also see a handful of cells with extremely high nCount_RNA. Potential high nCount_RNA outliers can be evaluated after initial clustering and annotation, since they may represent either genuine high-RNA cell populations or doublets/technical artifacts; this can guide whether stricter thresholds or doublet-detection tools are needed.
Let’s use a density plot to identify a threshold to filter low quality cells.
metadata %>%
ggplot(aes(x = nCount_RNA, fill = orig.ident, color = orig.ident)) +
geom_density(alpha = 0.2) +
theme_classic() +
scale_x_log10() +
scale_color_manual(values = sample_colors) +
scale_fill_manual(values = sample_colors) +
geom_vline(xintercept = 600,color="red",linetype="dotted")The multi-modal distribution could indicate techincal or biological differences between cell types. We can stay pretty conservative for now, and then reevaluate post-clustering.
nFeature_RNA (Number of Genes)
Now, let’s take a look at the number of detected features.
VlnPlot(
lung,
features = "nFeature_RNA",
layer = "counts",
group.by = "orig.ident",
pt.size = 0.05,
alpha=0.2
) +
scale_fill_manual(values = sample_colors) metadata %>%
ggplot(aes(x = nFeature_RNA, fill = orig.ident, color = orig.ident)) +
geom_density(alpha = 0.2) +
theme_classic() +
scale_x_log10() +
scale_fill_manual(values = sample_colors) +
scale_color_manual(values = sample_colors) +
geom_vline(xintercept=c(375,5500), color="red",linetype="dotted")There is a small left-hand shoulder visible, which could indicate poor quality cells, quiescent cells, or cells of low complexity.
percent.mt
VlnPlot(
lung,
features = "percent.mt",
group.by = "orig.ident",
pt.size = 0.05,
alpha= 0.05
) +
scale_fill_manual(values = condition_colors)Warning: Default search for "data" layer in "RNA" assay yielded no results;
utilizing "counts" layer instead.
metadata %>%
ggplot(aes(x = percent.mt, fill = orig.ident, color = orig.ident)) +
geom_density(alpha = 0.2) +
scale_x_log10() +
theme_classic() +
scale_fill_manual(values = sample_colors) +
scale_color_manual(values = sample_colors) +
#facet_wrap(~c(treatment)) +
geom_vline(xintercept=15)Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Warning: Removed 516 rows containing non-finite outside the scale range
(`stat_density()`).
Examine the metrics together
To look at how these metrics correlate, we can use FeatureScatter(), which can be used to visualize feature-feature relationships and also be applied to other data stored in our Seurat object (e.g., metadata columns, PC scores).
FeatureScatter(
lung,
feature1 = "nCount_RNA",
feature2 = "nFeature_RNA",
group.by = "orig.ident",
split.by = "treatment",
log = TRUE
) +
scale_color_manual(values = sample_colors)FeatureScatter(
lung,
feature1 = "percent.mt",
feature2 = "nFeature_RNA",
group.by = "orig.ident",
split.by = "treatment"
) +
scale_color_manual(values = sample_colors)metadata %>%
ggplot(aes(x = nCount_RNA, y = nFeature_RNA, fill = percent.mt > 15)) +
geom_point(shape = 21, alpha = 0.4, size = 1) +
theme_classic() +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~orig.ident, ncol = 2) +
geom_vline(xintercept = 600,color="red",linetype="dotted")+
geom_hline(yintercept=375,color="red", linetype="dotted")Decide on Thresholds
There is no single correct threshold set for every experiment. For this dataset, a reasonable and fairly permissive starting point is:
nFeature_RNA > 375nCount_RNA > 600-
percent.mt < 15for most samples; this could potentially be increased forS89Inf42.
These are only starting values. In a real analysis you would revisit them after clustering and also consider dedicated tools for doublet and ambient RNA detection.
Filter the Object
lung_filt <- subset(lung, cells = keep_cells)
lung_filtAn object of class Seurat
20329 features across 25823 samples within 1 assay
Active assay: RNA (20329 features, 0 variable features)
4 layers present: counts.S89Inf42, counts.S91Hom42, counts.S93Hom3, counts.S99Inf42
Once you have subset your object, it is recommended to run through these plots again. We are skipping this here.
You can quickly compare cell counts before and after filtering:
Save the filtered object before moving on. It is often useful to return to this checkpoint later.
saveRDS(lung_filt, "./outputs/merged_Seurat_lung_qcfiltered.rds")Normalization, Find Variable Features, and Scale
After quality filtering, the next step is to normalize the data and prepare it for dimensionality reduction. In the older Seurat workflow, this often proceeded as:
Why normalization?
The total number of detected transcripts expressed in a cell is dependent on the amount of mRNA in a cell. Cells naturally vary in the total amount of mRNA expressed. However, the chemistry of the assay inserts technical noise that results in variation in the counts of mRNA even between identical cells. Sources of technical variation include differences in cell lysis, reverse transcription efficiency, and stochastic molecular sampling during sequencing, among others. Because differences in observed expression result from a combination of biological and technical variation, we need to minimize or remove the impact of technical effects in order to compare expression levels across cells.
Standard Workflow
The standard normalization method in Seurat is a global-scaling normalization method (?NormalizeData(), default LogNormalize). It divides feature counts for each cell by that cell’s total counts and multiplies by a scale factor, usually 10,000. This is then natural log transformed with a pseudocount of 1.
Following normalization, FindVariableFeatures() identifies genes with high cell-to-cell variation. This focuses downstream steps on features that are likely to carry biological signal.
ScaleData() then centers and scales expression values before PCA. This is standard practice prior to dimensionality reduction. By default, this is applied only to the variable features selected above.
Scaling only the top variable features can lead to warnings later when visualizing genes that were not scaled. It is also possible to apply ScaleData() to the full dataset, but that is memory intensive.
This is included for comparison. We will use SCTransform() for the lung dataset below.
# Normalize
lung_filt <- NormalizeData(
lung_filt,
normalization.method = "LogNormalize",
scale.factor = 10000
)
# Find variable features
lung_filt <- FindVariableFeatures(
lung_filt,
selection.method = "vst",
nfeatures = 2000
)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(lung_filt), 10)
# Plot variable features with and without labels
plot1 <- VariableFeaturePlot(lung_filt)
LabelPoints(plot = plot1, points = top10, repel = TRUE)
# Scale the data
lung_filt <- ScaleData(lung_filt, vars.to.regress = "percent.mt")SCTransform (the new standard)
Rather than relying on the three-step workflow above, we will use SCTransform(), which performs normalization, variance stabilization, and feature selection in one step. This is a common Seurat v5 workflow for RNA data.
SCTransform() uses Pearson residuals for transformation, which better accounts for the distribution of single-cell expression values. It also gives more weight to lower-expressed genes that may show cell-type specific expression, instead of allowing highly expressed genes to dominate the analysis. Read more about the method here.
Things to know:
- Transformed data will be stored in the
SCTassay.
- The
SCTassay is set as the default assay after runningSCTransform().
- During normalization, we can regress out potential confounding variables such as mitochondrial mapping percentage.
- If
glmGamPoiis installed, Seurat can use it to makeSCTransform()substantially faster. SCTransform Vignette
SCTransform returns 3 layers:
-
counts- the regularized, corrected counts for all genes.
-
data- log-normalized version of the corrected counts.
-
scale.data- Pearson residuals for the variable features (3000 by default) (used for PCA)
Let’s run the function. Notice that we are regressing out the effect of percent mitochondrial reads. If mitochondrial expression is biologically meaningful for your system, compare results with and without this regression.
When you specify vars.to.regress in SCTransform(), those variables are included as additional covariates in the regularized negative binomial regression that SCTransform fits for each gene. The residuals it returns are then computed after accounting for those covariates, so their effect is “regressed out” of the resulting scale.data.
lung_filt <- SCTransform(
lung_filt,
vars.to.regress = "percent.mt",
verbose = FALSE
)Warning: package 'future' was built under R version 4.5.2
After running SCTransform(), the default assay becomes SCT.
DefaultAssay(lung_filt)[1] "SCT"
It is important to know which assay is active. You can check or change the default assay as needed:
# Check default assay
DefaultAssay(object = lung_filt)
# Set default assay
DefaultAssay(object = lung_filt) <- "RNA"The 10x Genomics normalization guide is a useful overview, and the SCTransform vignette provides more details about the method.
PCA
The role of cell clustering is to identify cells with similar transcriptomic profiles. Single-cell RNA-seq data are highly dimensional, with each cell associated with thousands of genes. PCA reduces this complexity by summarizing correlated gene-expression patterns into principal components (PCs).
Seurat uses these PCs as the basis for nearest-neighbor graph construction and clustering. The top PCs act like metafeatures that combine information across correlated genes, which helps reduce noise and speed up downstream analysis. Learn more about PCA here and here.
lung_filt <- RunPCA(lung_filt, assay = "SCT", verbose = FALSE)Exploring the PCA Results
After running PCA, we need to decide which PCs to include downstream. We want enough PCs to retain biological signal, but not so many that we mostly include noise.
Two useful visualizations are DimHeatmap() and ElbowPlot().
The elbow plot produces a ranking of principle components based on the percentage of variance explained by each one.
ElbowPlot(lung_filt, ndims = 40)DimHeatmap() allows us to visualize the top genes contributing to each PC. Both cells and genes are sorted by their principal component scores. By default it includes the top 30 genes with the highest and lowest PC loadings.
DimHeatmap(lung_filt, dims = 1:9, cells = 500, balanced = TRUE, ncol = 3)DimHeatmap(lung_filt, dims = 20:30, cells = 500, balanced = TRUE, ncol = 3)For this tutorial, we will start with the first 30 PCs. That is a reasonable default for an SCT workflow, but the exact choice should be checked against the elbow plot, PC heatmaps, and downstream stability.
Clustering and UMAP
Clustering is used to group cells by similar transcriptomic profiles. Seurat performs graph-based clustering:
-
FindNeighbors()builds a KNN graph using euclidean distances in PCA space, refining edge weights between cells using Jaccard similarity.
-
FindClusters()partitions that graph into communities using modularity optimization.
-
RunUMAP()provides a 2D visualization of the resulting structure.
There are multiple algorithm options for FindClusters(). The Louvain algorithm is used by default, which performs well in benchmarks. If you want to explore further, its successor the Leiden algorithm yields better-connected communities and is more computationally efficient, making it worth considering for more rigorous workflows. Both methods use a resolution parameter to control the granularity. Higher resolution values produce more clusters.
It is difficult to know whether clustering is “correct” at first glance. The right resolution depends on your scientific question. Are you interested in broad cell classes, subtypes, rare states, or injury- associated programs? Try several resolutions and compare marker genes, metadata, and biological interpretability.
Here, we keep resolution fairly low to get a preliminary view of broad cell types and states.
lung_filt <- FindNeighbors(lung_filt, dims = 1:30)Computing nearest neighbor graph
Computing SNN
lung_filt <- FindClusters(lung_filt, resolution = 0.2)Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 25823
Number of edges: 870221
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9685
Number of communities: 23
Elapsed time: 2 seconds
lung_filt <- RunUMAP(lung_filt, dims = 1:30)Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:05:20 UMAP embedding parameters a = 0.9922 b = 1.112
15:05:20 Read 25823 rows and found 30 numeric columns
15:05:20 Using Annoy for neighbor search, n_neighbors = 30
15:05:20 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:05:21 Writing NN index file to temp file /var/folders/hv/m8yfpqmn4sgfw9f0j7xfzjhc3fj3st/T//RtmpiAuEyn/fileca4a74c7efb
15:05:21 Searching Annoy index using 1 thread, search_k = 3000
15:05:26 Annoy recall = 100%
15:05:26 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:05:27 Initializing from normalized Laplacian + noise (using RSpectra)
15:05:28 Commencing optimization for 200 epochs, with 1116462 positive edges
15:05:28 Using rng type: pcg
15:05:35 Optimization finished
Visualize the result:
DimPlot(
lung_filt,
reduction = "umap",
group.by = "orig.ident",
split.by = "treatment",
ncol = 2,
label = FALSE,
alpha = 0.2
)Are the high ncount_RNA cells forming distinct cell populations? Do they sit between clusters?
orig.ident nCount_RNA nFeature_RNA
S99Inf42_GGAGGATGTCGCCACA S99Inf42 149774 7983
S93Hom3_TGTCCTGGTAAGCTCT S93Hom3 102501 6695
S89Inf42_GCCGATGAGGCTAGCA S89Inf42 101166 8330
S93Hom3_GGTGTTAAGGGCGAGA S93Hom3 94399 7925
S89Inf42_CTGCAGGCAGGGTCTC S89Inf42 90959 7914
S99Inf42_GACCCAGTCCCTCTCC S99Inf42 84601 9010
S89Inf42_GCCCAGATCGAAGCCC S89Inf42 79785 7787
S99Inf42_TGCTCGTCACCCTGAG S99Inf42 77016 5122
S93Hom3_TTCCTTCCACGTAACT S93Hom3 75400 7102
S91Hom42_CCAAGCGTCCTAGCTC S91Hom42 75025 4609
S99Inf42_AATTTCCCATGTGACT S99Inf42 72140 9500
S93Hom3_ATTCGTTTCCACTGGG S93Hom3 71425 5010
S99Inf42_CTTCCTTCATCGAACT S99Inf42 67695 7355
S93Hom3_GATGAGGAGATGAATC S93Hom3 65741 6315
S93Hom3_TATTTCGTCGCTCATC S93Hom3 65088 5124
S89Inf42_TCACTATGTTACACAC S89Inf42 64915 7060
S93Hom3_CTCAACCGTGGCTGAA S93Hom3 64624 5347
S93Hom3_CTGTCGTCAGGACTAG S93Hom3 64358 5714
S99Inf42_CTGAGCGCACATTCGA S99Inf42 62867 8540
S89Inf42_TAGACCACATCTGGGC S89Inf42 62550 7613
S99Inf42_ACTGTGATCCTTCTGG S99Inf42 62070 7293
S89Inf42_TCGGGACAGCGATGGT S89Inf42 61224 4491
S91Hom42_CGAAGGACAGGATTCT S91Hom42 60948 5866
S89Inf42_CGATGGCTCTGGCCGA S89Inf42 59699 8169
S89Inf42_AAAGGATGTATGGTTC S89Inf42 59611 5522
S99Inf42_GGAATGGGTCAAGCCC S99Inf42 59446 4917
S93Hom3_AGGAAATAGGGAACAA S93Hom3 58894 4056
S93Hom3_CTAAGTGGTGAGACCA S93Hom3 57782 7181
S93Hom3_TCGGGTGTCACAATGC S93Hom3 56289 4616
S89Inf42_GATGACTTCGGATACT S89Inf42 56262 7721
S89Inf42_CTAGGTACACTCGATA S89Inf42 55003 6356
S89Inf42_CTTGAGACACTGTCCT S89Inf42 53988 7637
S99Inf42_CTTTCGGCATTGGCAT S99Inf42 53850 7842
S91Hom42_TTGACCCCATCTTCGC S91Hom42 53626 3531
S99Inf42_TGACCCTTCAGTCTTT S99Inf42 53527 7076
S89Inf42_TGGAGAGGTCTGTAGT S89Inf42 53235 7231
S99Inf42_TGAGTCAAGCAAGTGC S99Inf42 53143 5309
S89Inf42_AGGTTGTAGGGCCCTT S89Inf42 53078 6341
S93Hom3_GTCAAACAGGTGATAT S93Hom3 52932 4344
S89Inf42_GCCTGTTAGGTCTTTG S89Inf42 52766 7807
S99Inf42_ACAGCCGGTACCGTGC S99Inf42 52560 7829
S93Hom3_AGGTAGGAGGGCAAGG S93Hom3 52511 3970
S89Inf42_GACTTCCTCTACACTT S89Inf42 52423 7652
S91Hom42_AAGACTCTCTTTGCGC S91Hom42 52198 5142
S89Inf42_CCTCATGTCAACTGAC S89Inf42 51731 7527
S91Hom42_GAAACCTAGCGGGTAT S91Hom42 51704 3813
S89Inf42_ACATCCCTCCATAGAC S89Inf42 51671 5030
S93Hom3_ATTTACCTCATTTGGG S93Hom3 51639 4329
S99Inf42_GGTTCTCTCAACGCTA S99Inf42 51596 8059
S99Inf42_GTTTGGAGTGTCTAAC S99Inf42 51168 5279
treatment time_point
S99Inf42_GGAGGATGTCGCCACA Tamoxifen_Influenza Infection_Day42
S93Hom3_TGTCCTGGTAAGCTCT Tamoxifen Homeostasis_Day3
S89Inf42_GCCGATGAGGCTAGCA Tamoxifen_Influenza Infection_Day42
S93Hom3_GGTGTTAAGGGCGAGA Tamoxifen Homeostasis_Day3
S89Inf42_CTGCAGGCAGGGTCTC Tamoxifen_Influenza Infection_Day42
S99Inf42_GACCCAGTCCCTCTCC Tamoxifen_Influenza Infection_Day42
S89Inf42_GCCCAGATCGAAGCCC Tamoxifen_Influenza Infection_Day42
S99Inf42_TGCTCGTCACCCTGAG Tamoxifen_Influenza Infection_Day42
S93Hom3_TTCCTTCCACGTAACT Tamoxifen Homeostasis_Day3
S91Hom42_CCAAGCGTCCTAGCTC Tamoxifen Homeostasis_Day42
S99Inf42_AATTTCCCATGTGACT Tamoxifen_Influenza Infection_Day42
S93Hom3_ATTCGTTTCCACTGGG Tamoxifen Homeostasis_Day3
S99Inf42_CTTCCTTCATCGAACT Tamoxifen_Influenza Infection_Day42
S93Hom3_GATGAGGAGATGAATC Tamoxifen Homeostasis_Day3
S93Hom3_TATTTCGTCGCTCATC Tamoxifen Homeostasis_Day3
S89Inf42_TCACTATGTTACACAC Tamoxifen_Influenza Infection_Day42
S93Hom3_CTCAACCGTGGCTGAA Tamoxifen Homeostasis_Day3
S93Hom3_CTGTCGTCAGGACTAG Tamoxifen Homeostasis_Day3
S99Inf42_CTGAGCGCACATTCGA Tamoxifen_Influenza Infection_Day42
S89Inf42_TAGACCACATCTGGGC Tamoxifen_Influenza Infection_Day42
S99Inf42_ACTGTGATCCTTCTGG Tamoxifen_Influenza Infection_Day42
S89Inf42_TCGGGACAGCGATGGT Tamoxifen_Influenza Infection_Day42
S91Hom42_CGAAGGACAGGATTCT Tamoxifen Homeostasis_Day42
S89Inf42_CGATGGCTCTGGCCGA Tamoxifen_Influenza Infection_Day42
S89Inf42_AAAGGATGTATGGTTC Tamoxifen_Influenza Infection_Day42
S99Inf42_GGAATGGGTCAAGCCC Tamoxifen_Influenza Infection_Day42
S93Hom3_AGGAAATAGGGAACAA Tamoxifen Homeostasis_Day3
S93Hom3_CTAAGTGGTGAGACCA Tamoxifen Homeostasis_Day3
S93Hom3_TCGGGTGTCACAATGC Tamoxifen Homeostasis_Day3
S89Inf42_GATGACTTCGGATACT Tamoxifen_Influenza Infection_Day42
S89Inf42_CTAGGTACACTCGATA Tamoxifen_Influenza Infection_Day42
S89Inf42_CTTGAGACACTGTCCT Tamoxifen_Influenza Infection_Day42
S99Inf42_CTTTCGGCATTGGCAT Tamoxifen_Influenza Infection_Day42
S91Hom42_TTGACCCCATCTTCGC Tamoxifen Homeostasis_Day42
S99Inf42_TGACCCTTCAGTCTTT Tamoxifen_Influenza Infection_Day42
S89Inf42_TGGAGAGGTCTGTAGT Tamoxifen_Influenza Infection_Day42
S99Inf42_TGAGTCAAGCAAGTGC Tamoxifen_Influenza Infection_Day42
S89Inf42_AGGTTGTAGGGCCCTT Tamoxifen_Influenza Infection_Day42
S93Hom3_GTCAAACAGGTGATAT Tamoxifen Homeostasis_Day3
S89Inf42_GCCTGTTAGGTCTTTG Tamoxifen_Influenza Infection_Day42
S99Inf42_ACAGCCGGTACCGTGC Tamoxifen_Influenza Infection_Day42
S93Hom3_AGGTAGGAGGGCAAGG Tamoxifen Homeostasis_Day3
S89Inf42_GACTTCCTCTACACTT Tamoxifen_Influenza Infection_Day42
S91Hom42_AAGACTCTCTTTGCGC Tamoxifen Homeostasis_Day42
S89Inf42_CCTCATGTCAACTGAC Tamoxifen_Influenza Infection_Day42
S91Hom42_GAAACCTAGCGGGTAT Tamoxifen Homeostasis_Day42
S89Inf42_ACATCCCTCCATAGAC Tamoxifen_Influenza Infection_Day42
S93Hom3_ATTTACCTCATTTGGG Tamoxifen Homeostasis_Day3
S99Inf42_GGTTCTCTCAACGCTA Tamoxifen_Influenza Infection_Day42
S99Inf42_GTTTGGAGTGTCTAAC Tamoxifen_Influenza Infection_Day42
treatment_tp percent.mt
S99Inf42_GGAGGATGTCGCCACA Tamoxifen_Influenza_Infection_Day42 2.206658
S93Hom3_TGTCCTGGTAAGCTCT Tamoxifen_Homeostasis_Day3 2.756071
S89Inf42_GCCGATGAGGCTAGCA Tamoxifen_Influenza_Infection_Day42 1.200996
S93Hom3_GGTGTTAAGGGCGAGA Tamoxifen_Homeostasis_Day3 2.725664
S89Inf42_CTGCAGGCAGGGTCTC Tamoxifen_Influenza_Infection_Day42 2.951879
S99Inf42_GACCCAGTCCCTCTCC Tamoxifen_Influenza_Infection_Day42 1.320315
S89Inf42_GCCCAGATCGAAGCCC Tamoxifen_Influenza_Infection_Day42 3.154728
S99Inf42_TGCTCGTCACCCTGAG Tamoxifen_Influenza_Infection_Day42 1.890516
S93Hom3_TTCCTTCCACGTAACT Tamoxifen_Homeostasis_Day3 1.133952
S91Hom42_CCAAGCGTCCTAGCTC Tamoxifen_Homeostasis_Day42 2.088637
S99Inf42_AATTTCCCATGTGACT Tamoxifen_Influenza_Infection_Day42 2.263654
S93Hom3_ATTCGTTTCCACTGGG Tamoxifen_Homeostasis_Day3 2.293315
S99Inf42_CTTCCTTCATCGAACT Tamoxifen_Influenza_Infection_Day42 1.236428
S93Hom3_GATGAGGAGATGAATC Tamoxifen_Homeostasis_Day3 1.832951
S93Hom3_TATTTCGTCGCTCATC Tamoxifen_Homeostasis_Day3 1.975787
S89Inf42_TCACTATGTTACACAC Tamoxifen_Influenza_Infection_Day42 2.871447
S93Hom3_CTCAACCGTGGCTGAA Tamoxifen_Homeostasis_Day3 2.217442
S93Hom3_CTGTCGTCAGGACTAG Tamoxifen_Homeostasis_Day3 1.532055
S99Inf42_CTGAGCGCACATTCGA Tamoxifen_Influenza_Infection_Day42 1.560437
S89Inf42_TAGACCACATCTGGGC Tamoxifen_Influenza_Infection_Day42 3.283773
S99Inf42_ACTGTGATCCTTCTGG Tamoxifen_Influenza_Infection_Day42 3.657161
S89Inf42_TCGGGACAGCGATGGT Tamoxifen_Influenza_Infection_Day42 1.246243
S91Hom42_CGAAGGACAGGATTCT Tamoxifen_Homeostasis_Day42 4.210146
S89Inf42_CGATGGCTCTGGCCGA Tamoxifen_Influenza_Infection_Day42 2.299871
S89Inf42_AAAGGATGTATGGTTC Tamoxifen_Influenza_Infection_Day42 2.316687
S99Inf42_GGAATGGGTCAAGCCC Tamoxifen_Influenza_Infection_Day42 2.284426
S93Hom3_AGGAAATAGGGAACAA Tamoxifen_Homeostasis_Day3 1.292152
S93Hom3_CTAAGTGGTGAGACCA Tamoxifen_Homeostasis_Day3 2.390018
S93Hom3_TCGGGTGTCACAATGC Tamoxifen_Homeostasis_Day3 1.176073
S89Inf42_GATGACTTCGGATACT Tamoxifen_Influenza_Infection_Day42 1.942697
S89Inf42_CTAGGTACACTCGATA Tamoxifen_Influenza_Infection_Day42 2.259877
S89Inf42_CTTGAGACACTGTCCT Tamoxifen_Influenza_Infection_Day42 2.141217
S99Inf42_CTTTCGGCATTGGCAT Tamoxifen_Influenza_Infection_Day42 2.378830
S91Hom42_TTGACCCCATCTTCGC Tamoxifen_Homeostasis_Day42 1.131914
S99Inf42_TGACCCTTCAGTCTTT Tamoxifen_Influenza_Infection_Day42 4.173595
S89Inf42_TGGAGAGGTCTGTAGT Tamoxifen_Influenza_Infection_Day42 2.783883
S99Inf42_TGAGTCAAGCAAGTGC Tamoxifen_Influenza_Infection_Day42 2.022844
S89Inf42_AGGTTGTAGGGCCCTT Tamoxifen_Influenza_Infection_Day42 3.619202
S93Hom3_GTCAAACAGGTGATAT Tamoxifen_Homeostasis_Day3 1.209098
S89Inf42_GCCTGTTAGGTCTTTG Tamoxifen_Influenza_Infection_Day42 3.684191
S99Inf42_ACAGCCGGTACCGTGC Tamoxifen_Influenza_Infection_Day42 1.953957
S93Hom3_AGGTAGGAGGGCAAGG Tamoxifen_Homeostasis_Day3 2.237626
S89Inf42_GACTTCCTCTACACTT Tamoxifen_Influenza_Infection_Day42 1.287603
S91Hom42_AAGACTCTCTTTGCGC Tamoxifen_Homeostasis_Day42 1.312311
S89Inf42_CCTCATGTCAACTGAC Tamoxifen_Influenza_Infection_Day42 1.650848
S91Hom42_GAAACCTAGCGGGTAT Tamoxifen_Homeostasis_Day42 1.522126
S89Inf42_ACATCCCTCCATAGAC Tamoxifen_Influenza_Infection_Day42 2.254650
S93Hom3_ATTTACCTCATTTGGG Tamoxifen_Homeostasis_Day3 1.413660
S99Inf42_GGTTCTCTCAACGCTA Tamoxifen_Influenza_Infection_Day42 1.876114
S99Inf42_GTTTGGAGTGTCTAAC Tamoxifen_Influenza_Infection_Day42 4.065041
nCount_SCT nFeature_SCT SCT_snn_res.0.2
S99Inf42_GGAGGATGTCGCCACA 7473 720 17
S93Hom3_TGTCCTGGTAAGCTCT 8595 697 13
S89Inf42_GCCGATGAGGCTAGCA 8599 2738 13
S93Hom3_GGTGTTAAGGGCGAGA 7869 1303 17
S89Inf42_CTGCAGGCAGGGTCTC 10584 1912 17
S99Inf42_GACCCAGTCCCTCTCC 5368 2663 14
S89Inf42_GCCCAGATCGAAGCCC 9425 2063 13
S99Inf42_TGCTCGTCACCCTGAG 7640 527 13
S93Hom3_TTCCTTCCACGTAACT 7852 1173 17
S91Hom42_CCAAGCGTCCTAGCTC 7653 482 13
S99Inf42_AATTTCCCATGTGACT 5643 3201 17
S93Hom3_ATTCGTTTCCACTGGG 8598 511 13
S99Inf42_CTTCCTTCATCGAACT 6574 1938 20
S93Hom3_GATGAGGAGATGAATC 7962 917 17
S93Hom3_TATTTCGTCGCTCATC 8347 609 13
S89Inf42_TCACTATGTTACACAC 8690 2441 13
S93Hom3_CTCAACCGTGGCTGAA 8207 645 13
S93Hom3_CTGTCGTCAGGACTAG 8373 695 17
S99Inf42_CTGAGCGCACATTCGA 5556 2894 14
S89Inf42_TAGACCACATCTGGGC 8178 2702 13
S99Inf42_ACTGTGATCCTTCTGG 6748 1477 17
S89Inf42_TCGGGACAGCGATGGT 11718 698 13
S91Hom42_CGAAGGACAGGATTCT 8013 1287 13
S89Inf42_CGATGGCTCTGGCCGA 7406 3427 14
S89Inf42_AAAGGATGTATGGTTC 8622 1873 8
S99Inf42_GGAATGGGTCAAGCCC 7521 455 13
S93Hom3_AGGAAATAGGGAACAA 8810 374 13
S93Hom3_CTAAGTGGTGAGACCA 7268 1728 17
S93Hom3_TCGGGTGTCACAATGC 8349 563 13
S89Inf42_GATGACTTCGGATACT 7363 3071 1
S89Inf42_CTAGGTACACTCGATA 8885 2094 13
S89Inf42_CTTGAGACACTGTCCT 7228 3188 14
S99Inf42_CTTTCGGCATTGGCAT 6112 2759 4
S91Hom42_TTGACCCCATCTTCGC 7838 433 13
S99Inf42_TGACCCTTCAGTCTTT 6292 1988 13
S89Inf42_TGGAGAGGTCTGTAGT 8375 2608 13
S99Inf42_TGAGTCAAGCAAGTGC 7202 861 17
S89Inf42_AGGTTGTAGGGCCCTT 8090 2001 13
S93Hom3_GTCAAACAGGTGATAT 8413 585 13
S89Inf42_GCCTGTTAGGTCTTTG 7763 3379 4
S99Inf42_ACAGCCGGTACCGTGC 5647 2980 14
S93Hom3_AGGTAGGAGGGCAAGG 8546 453 13
S89Inf42_GACTTCCTCTACACTT 7807 3447 12
S91Hom42_AAGACTCTCTTTGCGC 7736 978 13
S89Inf42_CCTCATGTCAACTGAC 7300 3191 14
S91Hom42_GAAACCTAGCGGGTAT 7729 545 13
S89Inf42_ACATCCCTCCATAGAC 9279 1009 13
S93Hom3_ATTTACCTCATTTGGG 8297 562 13
S99Inf42_GGTTCTCTCAACGCTA 5679 3021 12
S99Inf42_GTTTGGAGTGTCTAAC 7136 949 17
seurat_clusters
S99Inf42_GGAGGATGTCGCCACA 17
S93Hom3_TGTCCTGGTAAGCTCT 13
S89Inf42_GCCGATGAGGCTAGCA 13
S93Hom3_GGTGTTAAGGGCGAGA 17
S89Inf42_CTGCAGGCAGGGTCTC 17
S99Inf42_GACCCAGTCCCTCTCC 14
S89Inf42_GCCCAGATCGAAGCCC 13
S99Inf42_TGCTCGTCACCCTGAG 13
S93Hom3_TTCCTTCCACGTAACT 17
S91Hom42_CCAAGCGTCCTAGCTC 13
S99Inf42_AATTTCCCATGTGACT 17
S93Hom3_ATTCGTTTCCACTGGG 13
S99Inf42_CTTCCTTCATCGAACT 20
S93Hom3_GATGAGGAGATGAATC 17
S93Hom3_TATTTCGTCGCTCATC 13
S89Inf42_TCACTATGTTACACAC 13
S93Hom3_CTCAACCGTGGCTGAA 13
S93Hom3_CTGTCGTCAGGACTAG 17
S99Inf42_CTGAGCGCACATTCGA 14
S89Inf42_TAGACCACATCTGGGC 13
S99Inf42_ACTGTGATCCTTCTGG 17
S89Inf42_TCGGGACAGCGATGGT 13
S91Hom42_CGAAGGACAGGATTCT 13
S89Inf42_CGATGGCTCTGGCCGA 14
S89Inf42_AAAGGATGTATGGTTC 8
S99Inf42_GGAATGGGTCAAGCCC 13
S93Hom3_AGGAAATAGGGAACAA 13
S93Hom3_CTAAGTGGTGAGACCA 17
S93Hom3_TCGGGTGTCACAATGC 13
S89Inf42_GATGACTTCGGATACT 1
S89Inf42_CTAGGTACACTCGATA 13
S89Inf42_CTTGAGACACTGTCCT 14
S99Inf42_CTTTCGGCATTGGCAT 4
S91Hom42_TTGACCCCATCTTCGC 13
S99Inf42_TGACCCTTCAGTCTTT 13
S89Inf42_TGGAGAGGTCTGTAGT 13
S99Inf42_TGAGTCAAGCAAGTGC 17
S89Inf42_AGGTTGTAGGGCCCTT 13
S93Hom3_GTCAAACAGGTGATAT 13
S89Inf42_GCCTGTTAGGTCTTTG 4
S99Inf42_ACAGCCGGTACCGTGC 14
S93Hom3_AGGTAGGAGGGCAAGG 13
S89Inf42_GACTTCCTCTACACTT 12
S91Hom42_AAGACTCTCTTTGCGC 13
S89Inf42_CCTCATGTCAACTGAC 14
S91Hom42_GAAACCTAGCGGGTAT 13
S89Inf42_ACATCCCTCCATAGAC 13
S93Hom3_ATTTACCTCATTTGGG 13
S99Inf42_GGTTCTCTCAACGCTA 12
S99Inf42_GTTTGGAGTGTCTAAC 17
# Or define a high-count subset, for example top 1%
cutoff <- quantile(lung_filt$nCount_RNA, 0.99)
high_count_cells <- WhichCells(lung_filt, expression = nCount_RNA > cutoff)
length(high_count_cells)[1] 259
head(high_count_cells)[1] "S89Inf42_AAAGAACTCGGAGATG" "S89Inf42_AAAGGATGTATGGTTC"
[3] "S89Inf42_AATTCCTCAACAAGTA" "S89Inf42_ACATCCCTCCATAGAC"
[5] "S89Inf42_ACCTACCGTTATGGTC" "S89Inf42_ACCTGTCAGCACAAAT"
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
3 2 38 0 16 3 3 4 6 0 0 1 9 93 48 0 0 22 0 0 9 2 0
Do you have canonical markers?
Canonical markers are curated, biologically validated genes known to define specific cell types (e.g., PTPRC for immune cells, EPCAM for epithelial cells). You can use these directly to annotate clusters by visualizing expression with FeaturePlot(), VlnPlot(), or DotPlot(). Look for which clusters highly express your markers of interest. You can combine this with FindAllMarkers to statistically confirm annotations as well as other tools to annotate your clusters.
FeaturePlot(lung_filt, features = c( "Epcam","Ptprc"))Finding Cluster Biomarkers
Once we have clusters, we can use differential expression analysis to find potential marker genes that define each cluster. These marker genes are useful for assigning broad cell identities and for checking whether the clustering resolution is biologically sensible.
Because we are working with an SCT assay, run PrepSCTFindMarkers() before differential expression testing. This corrects the counts and joins the data when the seurat object includes multiple samples and multiple SCT models.
lung_filt <- PrepSCTFindMarkers(lung_filt, verbose = TRUE)Found 4 SCT models. Recorrecting SCT counts using minimum median counts: 5378
Now test each cluster against all other cells using FindAllMarkers. only.pos = TRUE keeps genes that are enriched in the cluster.
There are many options for differential expression testing. See the test.use parameter in ?FindAllMarkers(). By default, Seurat uses a Wilcoxon Rank Sum test.
The R package, presto is used to speed up the DE analysis; it is not required but is recommended. You may encounter issues when installing presto on Biowulf due to dependency requirements. You may find that downgrading RcppArmadillo to an earlier version, for example, “14.0.2-1” may help (remotes::install_version("RcppArmadillo", version = "14.0.2-1", repos = "https://cran.r-project.org")). If not, feel free to email us at ncibtep@nih.gov.
lung_filt_markers <- FindAllMarkers(lung_filt, only.pos = TRUE)Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Calculating cluster 15
Calculating cluster 16
Calculating cluster 17
Calculating cluster 18
Calculating cluster 19
Calculating cluster 20
Calculating cluster 21
Calculating cluster 22
# A tibble: 230 × 7
p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 2.09e-101 3.19 0.043 0.006 4.26e- 97 0 Cd1d2
2 3.27e-118 3.03 0.055 0.008 6.64e-114 0 Gm47625
3 8.52e- 24 2.98 0.01 0.001 1.73e- 19 0 Gm13715
4 5.68e- 50 2.94 0.025 0.004 1.15e- 45 0 Slco1c1
5 4.80e- 24 2.80 0.012 0.002 9.76e- 20 0 Gm50242
6 1.70e- 21 2.73 0.01 0.002 3.45e- 17 0 Lalba
7 9.38e- 82 2.69 0.042 0.007 1.91e- 77 0 Gm14221
8 1.73e- 73 2.59 0.042 0.008 3.51e- 69 0 Gm43504
9 9.74e- 53 2.58 0.027 0.005 1.98e- 48 0 Slc26a10
10 6.76e- 40 2.57 0.023 0.005 1.38e- 35 0 Gm31834
# ℹ 220 more rows
In the marker table, pct.1 is the percentage of cells in the cluster where the gene is detected, while pct.2 is the percentage of cells on average in all the other clusters where the gene is detected. A strong cluster marker is usually enriched in one cluster and low in others. You should also see a high avg_log2FC between your cluster of interest and all other clusters.
You can combine these results with visualizations such as heat maps (DoHeatmap()), violin plots (VlnPlot()), feature scatter plots (FeaturePlot()), and dot plots (DotPlot()).
These markers are only a starting point. Final annotation should combine marker expression, known biology, and results from reference-based annotation tools.
Cell Annotation
There are many tools and strategies available for cell annotation. A few commonly used options include:
Automated annotation should be treated as decision support, not ground truth. Always validate proposed labels against marker genes, QC metrics, metadata, and the biological context of the lung regeneration dataset.
Save the Clustered Object
saveRDS(lung_filt, "./outputs/merged_Seurat_lung_sct_clustered.rds")Next Steps
At this point, ask:
- Do clusters mix samples reasonably well?
- Are cells grouping by biology or mostly by sample?
- Do treatment and time point seem to explain expected structure?
- Do known lung marker genes help explain major clusters?
If cells cluster mainly by sample rather than biological state, that is when it becomes time to consider an integration workflow.
At this stage, we have completed a standard preprocessing workflow: quality control, normalization, dimensional reduction, clustering, visualization, and an initial marker-gene workflow. This gives us a useful first view of the structure in the dataset, but it is only the beginning of a full single-cell analysis.
Typical next steps may include:
-
Integration - if cells cluster primarily by sample or batch rather than biology, consider an integration workflow.
-
Cluster annotation - identify the major cell populations using marker genes, reference datasets, and annotation tools.
-
Subsetting and re-clustering - once broad cell classes are identified, subset populations of interest and repeat clustering at higher resolution.
-
Differential expression - compare clusters, conditions, or time points to identify genes associated with particular cell states or experimental effects.
- Cell composition analysis - test whether the proportions of major cell types differ across treatment groups or time points.
- Trajectory or state-transition analysis - for dynamic systems such as injury and regeneration, model transitions between cell states.
There is no single correct order for all downstream analyses. The right next step depends on your biological question, the quality of the clustering, and whether technical effects appear to be influencing the results.
References
The following resources were particularly helpful for this lesson:
- Seurat v5 getting started
- Seurat v5 announcements
- Seurat guided clustering tutorial
- SCTransform vignette
- Single-cell best practices
- HBC Training scRNA-seq lessons.
-
Luecken MD, Theis FJ. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol. 2019 Jun 19;15(6):e8746. doi: 10.15252/msb.20188746
- Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet 24, 550–572 (2023). https://doi.org/10.1038/s41576-023-00586-w