install.packages("Seurat")scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat
Learning Objectives
- Learn about options for analyzing your scRNA-Seq data.
- Learn about resources for learning R / Python programming.
- Learn about AI resources
- Learn how to import your data for working with R.
- Learn about Seurat and the Seurat object including how to create the object and access and manipulate stored data.
Exploratory Data Analysis: scRNA-Seq
This tutorial assumes that all pre-processing steps (read demultiplexing, FASTQ QC, reference based alignment, error correction) have been completed. At this stage of the analysis, a gene-by-cell count matrix has been generated for each sample.
Analysis Ecosystems
Following Cell Ranger or other preprocessing tools, you will typically obtain a gene-by-cell count matrix for each sample. The three dominant open-source ecosystems for downstream single-cell analysis are:
1. R — Seurat
https://satijalab.org/seurat/articles/get_started_v5_new
Seurat (Satija Lab) is a unified, high-level framework for single-cell transcriptomics, multimodal data (e.g., RNA + ATAC), and spatial transcriptomics. It has a large user base and extensive documentation, making it an excellent starting point for new analysts.
Seurat v5 introduced a layered assay structure and improved support for large-scale datasets, including disk-backed matrices and sketch-based workflows for million-cell analyses. The ecosystem also supports reference mapping (e.g., Azimuth) and integrates well with multimodal workflows.
2. R — Bioconductor (SingleCellExperiment ecosystem)
https://bioconductor.org/books/release/OSCA/
Bioconductor provides a modular, statistically rigorous ecosystem for analyzing biological data. Single-cell workflows are built around the SingleCellExperiment object and leverage highly interoperable packages (70+) including scater, scran, scuttle, and batchelor.
Rather than a single monolithic framework, Bioconductor emphasizes composability and statistical transparency.
Resources:
-
Orchestrating Single-Cell Analysis with Bioconductor - focuses on scRNA-Seq
- Orchestrating Spatial Transcriptomics Analysis with Bioconductor - focuses on spatial transcriptomics
3. Python — scverse
The scverse ecosystem is a collection of interoperable Python packages centered around the AnnData object. Core tools include:
-
scanpy(core analysis workflows) -
scvi-tools(deep generative modeling, batch correction, multimodal integration) -
muon/mudata(multimodal analysis) -
scirpy(immune repertoire) -
squidpy(spatial transcriptomics)
scverse is widely adopted in large atlas-scale projects and is particularly strong for machine learning–based workflows and scalable analysis.
Resources:
- Single-cell best practices - covers different analysis ecosystems and workflows.
These ecosystems require familiarity with either R or Python. Your choice will likely depend on your background, your lab’s preferences, and your scientific goals.
There are also tools outside of R and Python.
In practice, you may benefit from using tools across multiple ecosystems. Seurat, Bioconductor, and scverse use different object classes, but they store many of the same core components: expression matrices, cell metadata, feature metadata, dimensional reductions, graphs, and other analysis results.
| Concept | Seurat |
SingleCellExperiment |
AnnData |
|---|---|---|---|
| Main expression data |
assays, such as the RNA assay |
assays, such as counts or logcounts
|
X and layers
|
| Cell metadata | meta.data |
colData |
obs |
| Gene / feature metadata | stored within assays | rowData |
var |
| Dimensional reductions | reductions |
reducedDims |
obsm |
| Graphs / cell-cell relationships |
graphs, neighbors
|
colPairs |
obsp |
| Multimodal support | Multiple assays, e.g., RNA, ATAC, ADT |
altExp() for alternative feature sets; MultiAssayExperiment for broader multi-omics |
Usually one modality per AnnData; MuData object for multimodal |
| Miscellaneous results |
misc, tools, commands
|
metadata |
uns |
One important difference is matrix orientation. In Seurat and SingleCellExperiment, expression matrices are typically stored as features x cells. In AnnData, the primary matrix is stored as observations x variables, which for scRNA-seq usually means cells x genes. Conversion tools handle this transpose, but it is a common source of confusion when manually inspecting objects.
A useful mental map is:
-
Seurat@meta.data,colData(sce), andadata.obsall describe cells. - Seurat assay-level metadata describes genes or features, although some user- or tool-generated feature information may also appear in
misc; the closest equivalents arerowData(sce)andadata.var. - Seurat assay
layers, SCEassays, and AnnDatalayerscan store different versions of the expression matrix, such as raw counts, normalized data, or scaled data. - PCA, UMAP, and t-SNE coordinates live in Seurat
reductions, SCEreducedDims, and AnnDataobsm.
AnnData (.h5ad) remains the most useful cross-platform exchange format for moving single-cell data between Python, Seurat, and Bioconductor workflows. For R-based conversion, anndataR is now our preferred starting point because it can read and write .h5ad files and convert to/from both Seurat and SingleCellExperiment objects. zellkonverter remains useful for SingleCellExperiment <-> AnnData workflows, especially in Bioconductor-centered analyses.
Check out the anndataR documentation for helpful diagrams showing how AnnData maps to Seurat and SingleCellExperiment.
You may still see older tutorials recommending SeuratDisk for .h5Seurat <-> .h5ad conversion, but it does not appear to be actively maintained for current Seurat workflows and is not a reliable choice for Seurat v5 objects.
Even with better tooling, conversion can introduce edge cases. After conversion, verify that cell and gene identifiers, count layers, metadata, embeddings, and dimensional reductions are where you expect them. Older or non-standard .h5ad files can also cause problems; in those cases, it may be easiest to open the file in a Python anndata / scanpy environment and write out an updated .h5ad before converting again.
If you lack programming skills and learning to code seems daunting, there are GUI based alternatives including, but not limited to the following:
- Partek Flow (License available to NCI researchers)
- Qlucore Omics Explorer (License available to NCI-CCR researchers).
Python Learning Resources
R Learning Resources
-
R Cheat Sheets.
-
R Crash Course
- Free tutorials. There are at least 230 Git repositories that focus on R training related to bioinformatics. Check out Glittr.
BTEP lessons / courses
Python Vs R
The argument isn’t necessarily about which language is better, but which one you are most comfortable with and can get the job done.
For the remainder of this tutorial, we will focus on the Seurat ecosystem.
What is Seurat?
Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse types of single-cell data. —https://satijalab.org/seurat/
Advantages:
- beginning-to-end workflows
- Nice vignettes
- Regularly updated
- Somewhat extensible
- Scalable
Disadvantages:
- Updates break some functionality
- Interoperability is lacking
- Maintained by a single group
- Many functions are poorly documented
Updates with Seurat v5
You can read about major changes between Seurat v5 and v4 here.
Seurat v5 introduced the following new features:
- Integrative multi-modal analysis with bridge integration
- ‘Sketch’-based analysis of large data sets
- methods for spatial transcriptomics
- assay layers
How much R programming do I need to know to use Seurat?
The answer to this question will largely depend on the user. While some will be able to learn R on the go, others will need to know some amount of R programming prior to beginning their analysis. If needed, check out this quick R Refresher. There are also several R courses available through BTEP with self-contained, asynchronous content that can be used to learn independently of live sessions.
The most important thing to know is how to find help!
How to find help?
ALWAYS, ALWAYS, ALWAYS read the documentation.
Use the help pane in the lower right of RStudio or the functions, help() and help.search() or ? and ??.
Check out package vignettes (vignette()).
Check the Github site if one is available for help with a specific package. There may also be known issues listed under the Github Issues tab (Seurat issues).
Google for help!
Use Google or other internet browser to troubleshoot error message or find help performing a specific task. But, make sure you are precise and informative in your search. Stack Overflow is a particularly great resource. A good Google search includes 3 elements:
- The specific action (e.g., how to rename a column).
- The programming language (e.g., R programming).
- The specific style/technique for coding (e.g., dplyr or tidyverse package) Example: “How to rename a column in R with dplyr/tidyverse”. — https://crd230.github.io/tips.html.
If this fails, feel free to email us at BTEP.
Use Generative AI
Use Generative AI for help. For a list of generative AI resources available to NIH researchers, see our BTEP GenAI Resource Guide.
GenAI and Single Cell
Generative AI is beginning to support single-cell analysis by helping with workflow guidance, cluster annotation, marker interpretation, and summarizing results.
* SCassist is one example in the Seurat ecosystem; it uses large language models to help guide parameter choices and interpret scRNA-seq results.
- CellWhisperer is a newer natural-language interface for exploring single-cell datasets, making it easier to query and interpret results conversationally.
- scGPT and Geneformer are examples of single-cell foundation models trained on large transcriptomic datasets. These are primarily research tools and may support tasks such as cell annotation, representation learning, and prediction.
- Although these tools are promising, AI-generated suggestions should be treated as decision support rather than ground truth and should always be checked against QC metrics, marker genes, metadata, and biological context.
Accessing R / RStudio
Using R locally
Single cell analysis can be computationally demanding. Running the analysis locally will likely only be possible with smaller sample sizes ( <= 8 samples).
Mac or Windows (local installation): Follow instructions here: https://posit.co/download/rstudio-desktop/.
NIH-managed machines: Follow your institute’s IT installation policies or submit a service request if administrative privileges are required.
BPCells
The package BPCells can be used to manage SeuratObjects and make your analysis more computationally efficient. BPCells is an R package that stores single-cell count matrices on disk in a compressed format so very large datasets can be analyzed with much lower memory use. However, if your Seurat object fits comfortably in memory, don’t use BPCells just because it exists. Once you are in the 100k+ cell range, merging many samples, or starting to hit RAM limits, BPCells becomes worthwhile. For atlas-scale or million-cell work, it is the recommended direction, but expect some compatibility tradeoffs with plotting and third-party tools.
See this vignette for more information.
Sketch-Based Assays with Seurat v5
A sketch-based assay in Seurat v5 is a smaller, in-memory assay built from a representative subset of cells from a much larger dataset, typically using leverage-score sampling so that both common and rare cell states are better preserved than with simple random subsampling. Consider using a sketch when your dataset is too large for convenient in-memory analysis, such as large multi-sample or million-cell studies, but remember that the sketch is still an approximation: important populations should be checked for stability across sketch sizes and validated by projecting results back to the full dataset. Useful references: Seurat v5 sketch analysis vignette, SketchData reference, LeverageScore reference, and BPCells with Seurat Objects.
HPC Open OnDemand
Due to limits on computational resources, you may be interested in running your analysis on an HPC. Biowulf is the NIH high performance compute cluster. It has greater than 90k processors, and can easily perform large numbers of simultaneous jobs. Biowulf also includes greater than 600 pre-installed scientific software and databases. To use the NIH HPC Systems, you must have an NIH HPC account.
When using R interactively, the easiest way to access R on Biowulf is via an integrated development environment (e.g. RStudio, VSCode, etc.). These interactive applications are available on Biowulf through NIH HPC Open OnDemand.
To Connect to R with HPC Open OnDemand
Connect to the NIH network and use the following link to access Open OnDemand: https://hpcondemand.nih.gov.
Login with your PIV Card.
Select the interactive app of interest (e.g., RStudio).
-
Adjust computational resources for your session (CPUs, memory, local scratch, etc.) and Launch.
- A good starting point for scRNA-Seq:
- Number of Hours - 8
- Number of CPUs - 8
- Allocated Memory (GB) - 100
- Allocated Local Scratch (GB) - 50
- Number of Hours - 8
- A good starting point for scRNA-Seq:
Start the session.
If you do not already have a Biowulf account, you can obtain one by following the instructions here. NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory. Obtaining an account requires PI approval and a nominal fee of $40 per month. Accounts are renewed annually contingent upon PI approval.
When you apply for a Biowulf account you will be issued two primary storage spaces:/home/$USER (16 GB)/data/$USER (100 GB)
You may request more space in /data/$USER by filing an online storage request.
NIH HPC Documentation
If you need help with Biowulf, the NIH HPC systems are well-documented at hpc.nih.gov. The User guides, Training documentation, and How To docs are also fantastic resources for getting help with most HPC tasks.
For all other questions, consider emailing the HPC team directly at staff@hpc.nih.gov.
Getting Started with Seurat
The Data
In this tutorial, we are using a subset of data from Niethamer et al. 2025, Longitudinal single-cell profiles of lung regeneration after viral infection reveal persistent injury-associated cell states, published in Cell Stem Cell. Count matrices are available in GEO here, and the raw fastq files are available in the SRA.
A project directory containing these data and associated files can be downloaded here.
This study examined lung regeneration following influenza viral infection using single-cell RNA sequencing to track cell states longitudinally. The full study profiled multiple genotypes; however, the four samples used in this tutorial were all generated from mice carrying the Ki67-CreERT2; ROSA26-LSL-tdTomato reporter system, which enables lineage tracing of proliferating cells following tamoxifen induction. Our subset includes lung tissue collected at two homeostasis time points (day 3 and day 42) and two samples from the post-infection condition at day 42. The two post-infection samples (GSM8181589 and GSM8181599) serve as biological replicates of the infected condition at day 42, enabling comparison of injury-associated lung cell states against homeostatic controls.
| Sample | Species | Genotype | Treatment | Time Point | Status |
|---|---|---|---|---|---|
GSM8181589 (S89Inf42) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen + Influenza | Day 42 post-infection | Lung (post-infection) |
GSM8181591 (S91Hom42) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen | Homeostasis Day 42 | Lung (homeostasis) |
GSM8181593 (S93Hom3) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen | Homeostasis Day 3 | Lung (homeostasis) |
GSM8181599 (S99Inf42) |
Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen + Influenza | Day 42 post-infection | Lung (post-infection) |
Installation
Seurat can be installed directly from CRAN.
If you would like to install the development version or previous versions, see the installation instructions available here.
The version of R on Biowulf used for this tutorial is 4.5.2 (module -r avail '^R$'), and the version of Seurat that complements this installation is v5.4.0. Seurat does not need to be installed if using R on Biowulf.
This tutorial does not use renv for package management. Instead, we rely on the HPC’s centrally managed R library, which ensures that all packages are compiled against the correct system libraries. If you are running this tutorial on a personal machine or a different computing environment, you may need to install the required packages manually and may encounter system dependency issues not covered here.
Load the package from your R library. We will also go ahead and load packages from the tidyverse.
Seurat Pre-processing Overview
The steps below outline the standard Seurat preprocessing workflow, from raw count matrices to clustered cells. Before diving into the code, it is helpful to understand both the overall sequence of steps and the key decision you will make early on: Your analysis strategy (See Note 1).
Core preprocessing steps
Regardless of strategy, the major steps are the same:
Import data — read count matrices from Cell Ranger / STARsolo output or other (h5, mtx, or similar formats).
Quality control — filter low-quality cells based on the number of detected genes (
nFeature_RNA), UMI counts (nCount_RNA), the percentage of mitochondrial reads, and other factors.Normalization — scale gene expression values to account for differences in sequencing depth across cells (e.g., log-normalization) (
NormalizeData()).Feature selection — identify highly variable genes that drive biological differences, reducing noise from lowly or uniformly expressed genes (
FindVariableFeatures()). Handled automatically bySCTransform().-
Scaling — center and scale each gene’s expression across cells so that highly expressed genes do not dominate downstream analysis (
ScaleData()). This also allows regression of unwanted sources of variation (e.g. cell cycle, mitochondrial content).Note:
SCTransform()replaces both normalization and scaling (steps 3–4) in a single step using regularized negative binomial regression. It simultaneously corrects for sequencing depth and scales the data, and also performs feature selection (step 5). Dimensionality reduction — use PCA to compress the data, followed by UMAP or t-SNE for visualization.
Clustering — build a shared nearest-neighbor graph and apply a community detection algorithm to group cells with similar expression profiles.
Downstream analysis — identify marker genes, annotate cells, and compare conditions.
A key early decision is whether to work with a single merged object or individual per-sample objects. Both approaches support per-sample QC — the difference is primarily one of workflow style and how interactively you want to inspect and process each sample.
Strategy A: Merge early, use layers (this tutorial)
In Seurat v5, you can merge multiple samples into a single object while retaining per-sample count data in separate layers (e.g., counts.GSM8181589, counts.GSM8181591). QC metrics can be visualized across all samples simultaneously, and thresholds can still be applied on a per-sample basis by filtering on orig.ident. This approach works well for interactive, exploratory analysis where you want to compare samples side by side before deciding on cutoffs.
Best for: interactive workflows where you want to inspect all samples together, compare QC distributions across samples, and make informed filtering decisions before proceeding.
Considerations: Requires more hands-on decision-making at the QC stage, as thresholds are applied manually after inspecting the merged object.
Strategy B: Per-sample objects, process and merge later
Alternatively, you can create a separate Seurat object for each sample, run QC and normalization independently, and then merge or integrate the filtered objects. This approach lends itself to scripted, automated pipelines where each sample is processed with fixed thresholds from QC through to clustering without manual intervention between steps.
Best for: automated or reproducible pipelines where the same workflow is applied to each sample in sequence, or when samples need to be processed independently before any comparison.
Considerations: More objects to manage, but the workflow can be more easily scripted end-to-end. Samples can later be merged with merge().
If you want to process each sample as an individual object, you can create them using a for loop (see the collapsible tip in the Creating a Seurat Object section below), run QC and normalization per object, and merge the filtered objects before proceeding.
Whether you ultimately merge samples into a single object or keep them separate, an integration step is likely necessary when comparing across conditions or batches.
Importing Data
Now that we have loaded the package, we can import our data.
Generally, in R programming, functions that involve data import begin with “read / Read”. Seurat includes a number of read functions for different data / platform types.
For 10X scRNA-Seq data, the following functions will be most relevant:
-
Read10X()- primary argument is a directory from CellRanger containing the matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv files.
-
Read10X_h5()- reads the hdf5 file from 10X CellRanger (scATAC-Seq or scRNA-Seq).
-
ReadMtx()- read from local or remote (arguments for the .mtx file, cells/barcodes file, and features/genes file). This option is not just for 10X data, but is useful for most platforms and pipelines.
The function you choose will depend on the format of your data.
There are a number of other read and load functions available for different platforms and data types (e.g., Akoya CODEX, Nanostring SMI, SlideSeq, 10X Visium, 10X Xenium, Vizgen MERFISH). You can find relevant functions here.
If you are trying to import publicly available data, and the available count data was saved as a .txt, .tsv, or .csv file, you can import these data using regular base R import functions (e.g., read.table(), read.csv()). The data does not need to be in a sparse format to create a Seurat object. However, if you would like to save memory upon import by loading as a sparse data matrix, see readSparseCounts() from the Bioconductor package scuttle.
Should I use the filtered for unfiltered data from Cell Ranger?
The raw / unfiltered matrix includes all valid cell barcodes from GEMs (Gel Bead-In EMulsions). This includes GEMs with free-floating mRNA from lysed or dead cells. Cell Ranger uses a barcode vs UMI count rank plot (knee plot) to eliminate such background noise.
While using the raw files will give you greater control over filtering, these files are also much larger. You can visualize the web summary output from Cell Ranger to see what was filtered. If you do not see a clear inflection point, you may want to use the raw files and apply other techniques for filtering.
If you use STARsolo, an alternative to 10X CellRanger for gene quantification output, you can use ReadSTARsolo() to import the data with Seurat.
Loading hdf5 files
To load a single file, we use
S89Inf42 <- Read10X_h5("./data/GSM8181589_EEM-scRNA-126.h5")The only required argument is the path to the 10X h5 file.
To read multiple files, you can use a for loop or a function(s) from the apply family. For example,
# List files
files <- list.files(path = "./data/", pattern = "*.h5")
# Create a list of count matrices
#lapply returns a list
h5_read <- lapply(paste0("./data/", files), Read10X_h5)
# Automated way to assign names using the GSM accession number
#sapply returns a vector
# names(h5_read) <- sapply(files,
# function(x) { str_split_1(x, "_")[1] },
# USE.NAMES = FALSE)
# Or assign names manually
# Useful for new names
names(h5_read) <- c("S89Inf42", "S91Hom42", "S93Hom3", "S99Inf42")Let’s take a look at S89Inf42.
S89Inf42[1:5,1:5]5 x 5 sparse Matrix of class "dgCMatrix"
AAACCCAAGCAACAGC AAACCCAAGGTCGTAG AAACCCACAACTAGAA
4933401J01Rik . . .
Gm26206 . . .
Xkr4 . . .
Gm18956 . . .
Gm37180 . . .
AAACCCACATCGCTGG AAACCCACATGGAAGC
4933401J01Rik . .
Gm26206 . .
Xkr4 . .
Gm18956 . .
Gm37180 . .
The data is stored as a sparse matrix to save CPU, time, and memory. Each . is actually a 0.
Creating a Seurat Object
Once we have read in the matrices, the next step is to create a Seurat object. The Seurat object will be used to store the raw count matrices, sample information, and processed data (normalized counts, reductions, etc.).
CreateSeuratObject() is used to create the object. This requires the matrix of counts for the first argument (counts = S89Inf42). We can also include a project name (e.g., the sample name). Other useful arguments include min.cells and min.features, which allow some initial filtering. For example, min.cells = 10 will filter genes / features that are not present across a minimum of 10 cells, while min.features = 200 will filter cells that do not contain a minimum of 200 genes / features.
# Single sample
S89Inf42 <- CreateSeuratObject(counts = S89Inf42, project = "S89Inf42",
min.cells = 10, min.features = 200)Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')
# View S89Inf42
S89Inf42An object of class Seurat
18500 features across 5797 samples within 1 assay
Active assay: RNA (18500 features, 0 variable features)
1 layer present: counts
For all samples, we can use mapply to loop through the h5_read list and the list names (names(h5_read)).
$S89Inf42
An object of class Seurat
18500 features across 5797 samples within 1 assay
Active assay: RNA (18500 features, 0 variable features)
1 layer present: counts
$S91Hom42
An object of class Seurat
18821 features across 7329 samples within 1 assay
Active assay: RNA (18821 features, 0 variable features)
1 layer present: counts
$S93Hom3
An object of class Seurat
18965 features across 7142 samples within 1 assay
Active assay: RNA (18965 features, 0 variable features)
1 layer present: counts
$S99Inf42
An object of class Seurat
19207 features across 8226 samples within 1 assay
Active assay: RNA (19207 features, 0 variable features)
1 layer present: counts
# Remove the original sparse matrices
rm(h5_read)We can merge our Seurat objects, which are now in a list, in order to view the QC metrics and compare across samples. If samples are fairly different, QC thresholds should be applied on a per sample basis.
The cell IDs can overlap between samples, so here we append the sample ID as a prefix to each cell ID using add.cell.ids. If this isn’t done “and any cell names are duplicated, cell names will be appended with _X, where X is the numeric index of the object in c(x, y)” (See merge.Seurat()).
We could have created a merged object using h5_read, as a list can be provided for the argument counts (DO NOT RUN THIS STEP):
lung2 <- CreateSeuratObject(counts = h5_read, min.cells = 3,
min.features = 200)In Seurat 5, passing a list to CreateSeuratObject() is better supported than in earlier versions. However, this approach still does not give us control over cell IDs, and it bypasses the ability to set per-sample project names, which are useful for tracking samples during QC. For these reasons, we prefer the mapply + merge() approach used above.
If you want a separate object for each object, you can use this for loop:
# Create a Seurat object for each sample
for (file in files){
seurat_data <- Read10X_h5(paste0("./data/", file))
seurat_obj <- CreateSeuratObject(counts = seurat_data,
min.features = 200,
min.cells = 3,
project = file)
assign(file, seurat_obj)
}See here for an explanation of the code.
Inspect and work with a Seurat Object
The Seurat Object is a data container for single cell RNA-Seq and related data. It is an S4 object, which is a type of data structure that stores complex information (e.g., scRNA-Seq count matrix, associated sample information, and data / results generated from downstream analyses) in one or more slots. It is extensible and can interact with many different methods, including those from other developers.
You can see the primary slots using glimpse().
glimpse(S89Inf42)Formal class 'Seurat' [package "SeuratObject"] with 13 slots
..@ assays :List of 1
.. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
..@ meta.data :'data.frame': 5797 obs. of 3 variables:
.. ..$ orig.ident : Factor w/ 1 level "S89Inf42": 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ nCount_RNA : num [1:5797] 5341 3896 9012 6752 1339 ...
.. ..$ nFeature_RNA: int [1:5797] 2214 1513 2617 2595 201 3162 2097 1152 3149 3705 ...
..@ active.assay: chr "RNA"
..@ active.ident: Factor w/ 1 level "S89Inf42": 1 1 1 1 1 1 1 1 1 1 ...
.. ..- attr(*, "names")= chr [1:5797] "AAACCCAAGCAACAGC" "AAACCCACAACTAGAA" "AAACCCACATCGCTGG" "AAACCCACATGGAAGC" ...
..@ graphs : list()
..@ neighbors : list()
..@ reductions : list()
..@ images : list()
..@ project.name: chr "S89Inf42"
..@ misc : list()
..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
.. ..$ : int [1:3] 5 4 0
..@ commands : list()
..@ tools : list()
S89Inf42 has 13 slots: assays, meta.data, active.assay, active.ident, graphs, neighbors, reductions, images, misc, version, commands, and tools. See a description of the slots here. For guidance on accessing data stored in the SeuratObject, see this vignette.
Unlike previous iterations of Seurat, Seurat v5 contains assays in layers to accommodate different assays and data types. “For example, these layers can store: raw counts (layer=‘counts’), normalized data (layer=‘data’), or z-scored/variance-stabilized data (layer=‘scale.data’).”
The diagrams below summarize a typical Seurat v5 object structure. The first shows the top-level slots in the object, and the second zooms in on the RNA assay to show how Seurat v5 stores expression matrices in layers.
flowchart TB
A["Seurat v5 object"]
A --> B["assays"]
A --> C["meta.data"]
A --> D["active.assay"]
A --> E["active.ident"]
A --> F["reductions"]
A --> G["graphs"]
A --> H["neighbors"]
A --> I["images"]
A --> J["misc"]
A --> K["version"]
A --> L["commands"]
A --> M["tools"]
classDef main fill:#143642,color:#ffffff,stroke:#0b1f26,stroke-width:2px;
classDef slot fill:#d7efe9,color:#102a2e,stroke:#4f8a7b,stroke-width:1.5px;
class A main;
class B,C,D,E,F,G,H,I,J,K,L,M slot;
flowchart TB
A["Seurat object"]
A --> B["assays"]
B --> C["RNA assay"]
C --> D["layers"]
D --> E["counts<br/>raw counts"]
D --> F["data<br/>normalized expression"]
D --> G["scale.data<br/>scaled expression"]
E --> H["counts.S89Inf42"]
E --> I["counts.S91Hom42"]
E --> J["counts.S93Hom3"]
E --> K["counts.S99Inf42"]
C --> L["feature metadata"]
C --> M["variable features"]
classDef main fill:#143642,color:#ffffff,stroke:#0b1f26,stroke-width:2px;
classDef slot fill:#d7efe9,color:#102a2e,stroke:#4f8a7b,stroke-width:1.5px;
classDef assay fill:#f7d08a,color:#3b2c18,stroke:#b8871a,stroke-width:1.5px;
classDef layer fill:#f5f1e8,color:#2f2a1f,stroke:#a08d62,stroke-width:1px;
classDef note fill:#f3f7fb,color:#213547,stroke:#9db4c7,stroke-width:1px;
class A main;
class B slot;
class C,D assay;
class E,F,G,H,I,J,K layer;
class L,M note;
In a newly created object, many slots are still empty. As you run preprocessing, clustering, dimensionality reduction, and annotation, Seurat progressively fills these slots with new results.
To return the names of the layers:
Layers(S89Inf42)[1] "counts"
To access a specific count layer use:
Because we have just created this object, most of the slots are empty. We will save results back to the object and fill the slots as we work.
You can always check the version of Seurat that was used to build the object with:
S89Inf42@version[1] '5.4.0'
And check commands and parameters used to generate results with:
S89Inf42@commandsWe haven’t worked on the object, so there is nothing here. Compare with pbmc_small, a built-in dataset in the SeuratObject package.
head(pbmc_small@commands, 1)$NormalizeData.RNA
Command: NormalizeData(object = pbmc_small)
Time: 2018-08-27 22:32:17.136699
assay : RNA
normalization.method : LogNormalize
scale.factor : 10000
verbose : TRUE
This information is useful for documenting your analysis.
Examine metadata
The metadata in the Seurat object is located in lung@meta.data and contains the information associated with each cell.
We can access the metadata using:
head(lung@meta.data) # using head to return only the first 6 rows orig.ident nCount_RNA nFeature_RNA
S89Inf42_AAACCCAAGCAACAGC S89Inf42 5341 2214
S89Inf42_AAACCCACAACTAGAA S89Inf42 3896 1513
S89Inf42_AAACCCACATCGCTGG S89Inf42 9012 2617
S89Inf42_AAACCCACATGGAAGC S89Inf42 6752 2595
S89Inf42_AAACCCAGTATGGAGC S89Inf42 1339 201
S89Inf42_AAACCCAGTCATCACA S89Inf42 14257 3162
or
head(lung[[]]) orig.ident nCount_RNA nFeature_RNA
S89Inf42_AAACCCAAGCAACAGC S89Inf42 5341 2214
S89Inf42_AAACCCACAACTAGAA S89Inf42 3896 1513
S89Inf42_AAACCCACATCGCTGG S89Inf42 9012 2617
S89Inf42_AAACCCACATGGAAGC S89Inf42 6752 2595
S89Inf42_AAACCCAGTATGGAGC S89Inf42 1339 201
S89Inf42_AAACCCAGTCATCACA S89Inf42 14257 3162
Accessing a column or columns:
# Access a single column
head(lung$orig.ident)S89Inf42_AAACCCAAGCAACAGC S89Inf42_AAACCCACAACTAGAA S89Inf42_AAACCCACATCGCTGG
"S89Inf42" "S89Inf42" "S89Inf42"
S89Inf42_AAACCCACATGGAAGC S89Inf42_AAACCCAGTATGGAGC S89Inf42_AAACCCAGTCATCACA
"S89Inf42" "S89Inf42" "S89Inf42"
head(lung[["orig.ident"]]) orig.ident
S89Inf42_AAACCCAAGCAACAGC S89Inf42
S89Inf42_AAACCCACAACTAGAA S89Inf42
S89Inf42_AAACCCACATCGCTGG S89Inf42
S89Inf42_AAACCCACATGGAAGC S89Inf42
S89Inf42_AAACCCAGTATGGAGC S89Inf42
S89Inf42_AAACCCAGTCATCACA S89Inf42
orig.ident nCount_RNA
S89Inf42_AAACCCAAGCAACAGC S89Inf42 5341
S89Inf42_AAACCCACAACTAGAA S89Inf42 3896
S89Inf42_AAACCCACATCGCTGG S89Inf42 9012
As we can see, there are different ways to access and work with the metadata. In addition, we can see that we begin with built-in information such as the orig.ident, nCount_RNA, and nFeature_RNA.
orig.ident - defaults to the project labelsnCount_RNA - number of UMIs per cellnFeature_RNA - number of genes / features per cell
This information, along with other metrics, will be used for quality filtering.
Each cell in a Seurat object can be assigned an identity label, and Seurat stores one set of these labels as the active Ident (active.ident). These identity labels usually come from metadata categories such as sample ID, cluster membership, condition, or cell type annotation. Many Seurat functions use the active.ident automatically when grouping cells for visualization or analysis.See ?Idents() and this vignette for working with idents.
We can subset by features or cells by lung[1:10,] or lung[,1:10] respectively. To subset using metadata, use subset().
For example, if we want to subset to the infected samples only, we can use:
An object of class Seurat
19736 features across 14023 samples within 1 assay
Active assay: RNA (19736 features, 0 variable features)
2 layers present: counts.S89Inf42, counts.S99Inf42
We will look at subsetting more when filtering cells based on quality.
Add to metadata
We can add information to our metadata by accessing and assigning to metadata columns or using ?AddMetaData().
Add treatment condition to the metadata (Tamoxifen only vs. Tamoxifen + Influenza):
lung$treatment <- ifelse(str_detect(lung@meta.data$orig.ident,
"S91Hom42|S93Hom3"),
"Tamoxifen", "Tamoxifen_Influenza")Add time point information to the metadata:
lung$time_point <- case_when(
str_detect(lung@meta.data$orig.ident, "S93Hom3") ~ "Homeostasis_Day3",
str_detect(lung@meta.data$orig.ident, "S91Hom42") ~ "Homeostasis_Day42",
.default = "Infection_Day42"
)Add treatment + time point to the metadata:
lung$treatment_tp <- paste(lung$treatment, lung$time_point, sep="_")Other useful information from the Seurat object
The Seurat v5 object doesn’t require all assays have the same cells. In this case, Cells() can be used to return the cell names of the default assay while colnames() can be used to return the cell names of the entire object.
[1] "S89Inf42_AAACCCAAGCAACAGC" "S89Inf42_AAACCCACAACTAGAA"
[3] "S89Inf42_AAACCCACATCGCTGG" "S89Inf42_AAACCCACATGGAAGC"
[5] "S89Inf42_AAACCCAGTATGGAGC" "S89Inf42_AAACCCAGTCATCACA"
[1] "S89Inf42_AAACCCAAGCAACAGC" "S89Inf42_AAACCCACAACTAGAA"
[3] "S89Inf42_AAACCCACATCGCTGG" "S89Inf42_AAACCCACATGGAAGC"
[5] "S89Inf42_AAACCCAGTATGGAGC" "S89Inf42_AAACCCAGTCATCACA"
To return feature / gene names:
[1] "Xkr4" "Gm19938" "Rp1" "Sox17" "Gm37587" "Gm7357"
[1] "Xkr4" "Gm19938" "Rp1" "Sox17" "Gm37587" "Gm7357"
For multimodal data, list the assays using:
Assays(lung)[1] "RNA"
To return the number of cells or features use:
[1] 20329 28494
ncol(lung) will return the number of cells across all layers. If you want a specific layer, use ncol(lung[["RNA"]]$counts.S89Inf42). This information is also quickly accessible via the Global Environment when using RStudio or by using str().
We can also quickly plot the number of cells from the metadata:
# colors
sample_colors <- c(
S89Inf42 = "#0072B2", # blue
S91Hom42 = "#E69F00", # orange
S93Hom3 = "#009E73", # bluish green
S99Inf42 = "#CC79A7" # reddish purple
)
# Visualize the number of cell counts per sample
lung@meta.data %>%
ggplot(aes(x = orig.ident, fill = orig.ident)) +
geom_bar(color = "black") +
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = after_stat(count)),
position = position_stack(vjust = 0.5)) +
scale_fill_manual(values=sample_colors)+
theme_classic() +
theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
ggtitle("Number of Cells per Sample")Save the object
At this point, our Seurat object may be fairly large. It is wise to save this for downstream applications using saveRDS().
saveRDS(lung, "./outputs/merged_Seurat_lung.rds")Other Tutorials of Interest
References
The following resources were instrumental in designing this lesson: