scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat

Modified

May 10, 2026

Learning Objectives

Learn about options for analyzing your scRNA-Seq data.
Learn about resources for learning R / Python programming.
Learn about AI resources
Learn how to import your data for working with R.
Learn about Seurat and the Seurat object including how to create the object and access and manipulate stored data.

Exploratory Data Analysis: scRNA-Seq

This tutorial assumes that all pre-processing steps (read demultiplexing, FASTQ QC, reference based alignment, error correction) have been completed. At this stage of the analysis, a gene-by-cell count matrix has been generated for each sample.

Analysis Ecosystems

Following Cell Ranger or other preprocessing tools, you will typically obtain a gene-by-cell count matrix for each sample. The three dominant open-source ecosystems for downstream single-cell analysis are:

1. R — Seurat

https://satijalab.org/seurat/articles/get_started_v5_new

Seurat (Satija Lab) is a unified, high-level framework for single-cell transcriptomics, multimodal data (e.g., RNA + ATAC), and spatial transcriptomics. It has a large user base and extensive documentation, making it an excellent starting point for new analysts.

Seurat v5 introduced a layered assay structure and improved support for large-scale datasets, including disk-backed matrices and sketch-based workflows for million-cell analyses. The ecosystem also supports reference mapping (e.g., Azimuth) and integrates well with multimodal workflows.

2. R — Bioconductor (SingleCellExperiment ecosystem)

https://bioconductor.org/books/release/OSCA/

Bioconductor provides a modular, statistically rigorous ecosystem for analyzing biological data. Single-cell workflows are built around the SingleCellExperiment object and leverage highly interoperable packages (70+) including scater, scran, scuttle, and batchelor.

Rather than a single monolithic framework, Bioconductor emphasizes composability and statistical transparency.

Resources:

Orchestrating Single-Cell Analysis with Bioconductor - focuses on scRNA-Seq
Orchestrating Spatial Transcriptomics Analysis with Bioconductor - focuses on spatial transcriptomics

3. Python — scverse

https://scverse.org/learn/

The scverse ecosystem is a collection of interoperable Python packages centered around the AnnData object. Core tools include:

scanpy (core analysis workflows)
scvi-tools (deep generative modeling, batch correction, multimodal integration)
muon / mudata (multimodal analysis)
scirpy (immune repertoire)
squidpy (spatial transcriptomics)

scverse is widely adopted in large atlas-scale projects and is particularly strong for machine learning–based workflows and scalable analysis.

Resources:

Single-cell best practices - covers different analysis ecosystems and workflows.

These ecosystems require familiarity with either R or Python. Your choice will likely depend on your background, your lab’s preferences, and your scientific goals.

Note

There are also tools outside of R and Python.

Interoperability Across Ecosystems

In practice, you may benefit from using tools across multiple ecosystems. Seurat, Bioconductor, and scverse use different object classes, but they store many of the same core components: expression matrices, cell metadata, feature metadata, dimensional reductions, graphs, and other analysis results.

Concept	`Seurat`	`SingleCellExperiment`	`AnnData`
Main expression data	`assays`, such as the `RNA` assay	`assays`, such as `counts` or `logcounts`	`X` and `layers`
Cell metadata	`meta.data`	`colData`	`obs`
Gene / feature metadata	stored within assays	`rowData`	`var`
Dimensional reductions	`reductions`	`reducedDims`	`obsm`
Graphs / cell-cell relationships	`graphs`, `neighbors`	`colPairs`	`obsp`
Multimodal support	Multiple assays, e.g., RNA, ATAC, ADT	`altExp()` for alternative feature sets; `MultiAssayExperiment` for broader multi-omics	Usually one modality per `AnnData`; `MuData` object for multimodal
Miscellaneous results	`misc`, `tools`, `commands`	`metadata`	`uns`

One important difference is matrix orientation. In Seurat and SingleCellExperiment, expression matrices are typically stored as features x cells. In AnnData, the primary matrix is stored as observations x variables, which for scRNA-seq usually means cells x genes. Conversion tools handle this transpose, but it is a common source of confusion when manually inspecting objects.

A useful mental map is:

Seurat@meta.data, colData(sce), and adata.obs all describe cells.
Seurat assay-level metadata describes genes or features, although some user- or tool-generated feature information may also appear in misc; the closest equivalents are rowData(sce) and adata.var.
Seurat assay layers, SCE assays, and AnnData layers can store different versions of the expression matrix, such as raw counts, normalized data, or scaled data.
PCA, UMAP, and t-SNE coordinates live in Seurat reductions, SCE reducedDims, and AnnData obsm.

AnnData (.h5ad) remains the most useful cross-platform exchange format for moving single-cell data between Python, Seurat, and Bioconductor workflows. For R-based conversion, anndataR is now our preferred starting point because it can read and write .h5ad files and convert to/from both Seurat and SingleCellExperiment objects. zellkonverter remains useful for SingleCellExperiment <-> AnnData workflows, especially in Bioconductor-centered analyses.

Check out the anndataR documentation for helpful diagrams showing how AnnData maps to Seurat and SingleCellExperiment.

You may still see older tutorials recommending SeuratDisk for .h5Seurat <-> .h5ad conversion, but it does not appear to be actively maintained for current Seurat workflows and is not a reliable choice for Seurat v5 objects.

Even with better tooling, conversion can introduce edge cases. After conversion, verify that cell and gene identifiers, count layers, metadata, embeddings, and dimensional reductions are where you expect them. Older or non-standard .h5ad files can also cause problems; in those cases, it may be easiest to open the file in a Python anndata / scanpy environment and write out an updated .h5ad before converting again.

Point-and-click alternatives

If you lack programming skills and learning to code seems daunting, there are GUI based alternatives including, but not limited to the following:

Partek Flow (License available to NCI researchers)
- Relevant trainings:
Qlucore Omics Explorer (License available to NCI-CCR researchers).

Python Learning Resources

R Learning Resources

R Cheat Sheets.
R Crash Course
Free tutorials. There are at least 230 Git repositories that focus on R training related to bioinformatics. Check out Glittr.

BTEP lessons / courses

Python Vs R

The argument isn’t necessarily about which language is better, but which one you are most comfortable with and can get the job done.

For the remainder of this tutorial, we will focus on the Seurat ecosystem.

What is Seurat?

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse types of single-cell data. —https://satijalab.org/seurat/

Advantages:

beginning-to-end workflows
Nice vignettes
Regularly updated
Somewhat extensible
Scalable

Disadvantages:

Updates break some functionality
Interoperability is lacking
Maintained by a single group
Many functions are poorly documented

Updates with Seurat v5

You can read about major changes between Seurat v5 and v4 here.

Seurat v5 introduced the following new features:

Integrative multi-modal analysis with bridge integration
‘Sketch’-based analysis of large data sets
methods for spatial transcriptomics
assay layers

How much R programming do I need to know to use Seurat?

The answer to this question will largely depend on the user. While some will be able to learn R on the go, others will need to know some amount of R programming prior to beginning their analysis. If needed, check out this quick R Refresher. There are also several R courses available through BTEP with self-contained, asynchronous content that can be used to learn independently of live sessions.

The most important thing to know is how to find help!

How to find help?

ALWAYS, ALWAYS, ALWAYS read the documentation.

Use the help pane in the lower right of RStudio or the functions, help() and help.search() or ? and ??.

Check out package vignettes (vignette()).

Check the Github site if one is available for help with a specific package. There may also be known issues listed under the Github Issues tab (Seurat issues).

Google for help!

Use Google or other internet browser to troubleshoot error message or find help performing a specific task. But, make sure you are precise and informative in your search. Stack Overflow is a particularly great resource. A good Google search includes 3 elements:

The specific action (e.g., how to rename a column).
The programming language (e.g., R programming).
The specific style/technique for coding (e.g., dplyr or tidyverse package) Example: “How to rename a column in R with dplyr/tidyverse”. — https://crd230.github.io/tips.html.

If this fails, feel free to email us at BTEP.

Use Generative AI

Use Generative AI for help. For a list of generative AI resources available to NIH researchers, see our BTEP GenAI Resource Guide.

GenAI and Single Cell

Generative AI is beginning to support single-cell analysis by helping with workflow guidance, cluster annotation, marker interpretation, and summarizing results.
* SCassist is one example in the Seurat ecosystem; it uses large language models to help guide parameter choices and interpret scRNA-seq results.

CellWhisperer is a newer natural-language interface for exploring single-cell datasets, making it easier to query and interpret results conversationally.
scGPT and Geneformer are examples of single-cell foundation models trained on large transcriptomic datasets. These are primarily research tools and may support tasks such as cell annotation, representation learning, and prediction.
Although these tools are promising, AI-generated suggestions should be treated as decision support rather than ground truth and should always be checked against QC metrics, marker genes, metadata, and biological context.

Accessing R / RStudio

Using R locally

Single cell analysis can be computationally demanding. Running the analysis locally will likely only be possible with smaller sample sizes ( <= 8 samples).

R Installation

Mac or Windows (local installation): Follow instructions here: https://posit.co/download/rstudio-desktop/.

NIH-managed machines: Follow your institute’s IT installation policies or submit a service request if administrative privileges are required.

Scalability with Seurat

BPCells

The package BPCells can be used to manage SeuratObjects and make your analysis more computationally efficient. BPCells is an R package that stores single-cell count matrices on disk in a compressed format so very large datasets can be analyzed with much lower memory use. However, if your Seurat object fits comfortably in memory, don’t use BPCells just because it exists. Once you are in the 100k+ cell range, merging many samples, or starting to hit RAM limits, BPCells becomes worthwhile. For atlas-scale or million-cell work, it is the recommended direction, but expect some compatibility tradeoffs with plotting and third-party tools.

See this vignette for more information.

Sketch-Based Assays with Seurat v5

A sketch-based assay in Seurat v5 is a smaller, in-memory assay built from a representative subset of cells from a much larger dataset, typically using leverage-score sampling so that both common and rare cell states are better preserved than with simple random subsampling. Consider using a sketch when your dataset is too large for convenient in-memory analysis, such as large multi-sample or million-cell studies, but remember that the sketch is still an approximation: important populations should be checked for stability across sketch sizes and validated by projecting results back to the full dataset. Useful references: Seurat v5 sketch analysis vignette, SketchData reference, LeverageScore reference, and BPCells with Seurat Objects.

HPC Open OnDemand

Due to limits on computational resources, you may be interested in running your analysis on an HPC. Biowulf is the NIH high performance compute cluster. It has greater than 90k processors, and can easily perform large numbers of simultaneous jobs. Biowulf also includes greater than 600 pre-installed scientific software and databases. To use the NIH HPC Systems, you must have an NIH HPC account.

When using R interactively, the easiest way to access R on Biowulf is via an integrated development environment (e.g. RStudio, VSCode, etc.). These interactive applications are available on Biowulf through NIH HPC Open OnDemand.

To Connect to R with HPC Open OnDemand

Connect to the NIH network and use the following link to access Open OnDemand: https://hpcondemand.nih.gov.
Login with your PIV Card.
Select the interactive app of interest (e.g., RStudio).
Adjust computational resources for your session (CPUs, memory, local scratch, etc.) and Launch.
- A good starting point for scRNA-Seq:
  - Number of Hours - 8
  - Number of CPUs - 8
  - Allocated Memory (GB) - 100
  - Allocated Local Scratch (GB) - 50
Start the session.

Need a Biowulf Account?

If you do not already have a Biowulf account, you can obtain one by following the instructions here. NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory. Obtaining an account requires PI approval and a nominal fee of $40 per month. Accounts are renewed annually contingent upon PI approval.

When you apply for a Biowulf account you will be issued two primary storage spaces:
/home/$USER (16 GB)
/data/$USER (100 GB)

You may request more space in /data/$USER by filing an online storage request.

NIH HPC Documentation

If you need help with Biowulf, the NIH HPC systems are well-documented at hpc.nih.gov. The User guides, Training documentation, and How To docs are also fantastic resources for getting help with most HPC tasks.

For all other questions, consider emailing the HPC team directly at staff@hpc.nih.gov.

Getting Started with Seurat

The Data

In this tutorial, we are using a subset of data from Niethamer et al. 2025, Longitudinal single-cell profiles of lung regeneration after viral infection reveal persistent injury-associated cell states, published in Cell Stem Cell. Count matrices are available in GEO here, and the raw fastq files are available in the SRA.

A project directory containing these data and associated files can be downloaded here.

This study examined lung regeneration following influenza viral infection using single-cell RNA sequencing to track cell states longitudinally. The full study profiled multiple genotypes; however, the four samples used in this tutorial were all generated from mice carrying the Ki67-CreERT2; ROSA26-LSL-tdTomato reporter system, which enables lineage tracing of proliferating cells following tamoxifen induction. Our subset includes lung tissue collected at two homeostasis time points (day 3 and day 42) and two samples from the post-infection condition at day 42. The two post-infection samples (GSM8181589 and GSM8181599) serve as biological replicates of the infected condition at day 42, enabling comparison of injury-associated lung cell states against homeostatic controls.

Sample	Species	Genotype	Treatment	Time Point	Status
GSM8181589 (`S89Inf42`)	Mouse	Ki67-CreERT2; ROSA26-LSL-tdTomato	Tamoxifen + Influenza	Day 42 post-infection	Lung (post-infection)
GSM8181591 (`S91Hom42`)	Mouse	Ki67-CreERT2; ROSA26-LSL-tdTomato	Tamoxifen	Homeostasis Day 42	Lung (homeostasis)
GSM8181593 (`S93Hom3`)	Mouse	Ki67-CreERT2; ROSA26-LSL-tdTomato	Tamoxifen	Homeostasis Day 3	Lung (homeostasis)
GSM8181599 (`S99Inf42`)	Mouse	Ki67-CreERT2; ROSA26-LSL-tdTomato	Tamoxifen + Influenza	Day 42 post-infection	Lung (post-infection)

Installation

Seurat can be installed directly from CRAN.

install.packages("Seurat")

If you would like to install the development version or previous versions, see the installation instructions available here.

Note

The version of R on Biowulf used for this tutorial is 4.5.2 (module -r avail '^R$'), and the version of Seurat that complements this installation is v5.4.0. Seurat does not need to be installed if using R on Biowulf.

This tutorial does not use renv for package management. Instead, we rely on the HPC’s centrally managed R library, which ensures that all packages are compiled against the correct system libraries. If you are running this tutorial on a personal machine or a different computing environment, you may need to install the required packages manually and may encounter system dependency issues not covered here.

Load the package from your R library. We will also go ahead and load packages from the tidyverse.

library(Seurat)
library(tidyverse)

Seurat Pre-processing Overview

The steps below outline the standard Seurat preprocessing workflow, from raw count matrices to clustered cells. Before diving into the code, it is helpful to understand both the overall sequence of steps and the key decision you will make early on: Your analysis strategy (See Note 1).

Core preprocessing steps

Regardless of strategy, the major steps are the same:

Import data — read count matrices from Cell Ranger / STARsolo output or other (h5, mtx, or similar formats).
Quality control — filter low-quality cells based on the number of detected genes (nFeature_RNA), UMI counts (nCount_RNA), the percentage of mitochondrial reads, and other factors.
Normalization — scale gene expression values to account for differences in sequencing depth across cells (e.g., log-normalization) (NormalizeData()).
Feature selection — identify highly variable genes that drive biological differences, reducing noise from lowly or uniformly expressed genes (FindVariableFeatures()). Handled automatically by SCTransform().
Scaling — center and scale each gene’s expression across cells so that highly expressed genes do not dominate downstream analysis (ScaleData()). This also allows regression of unwanted sources of variation (e.g. cell cycle, mitochondrial content).

Note: SCTransform() replaces both normalization and scaling (steps 3–4) in a single step using regularized negative binomial regression. It simultaneously corrects for sequencing depth and scales the data, and also performs feature selection (step 5).
Dimensionality reduction — use PCA to compress the data, followed by UMAP or t-SNE for visualization.
Clustering — build a shared nearest-neighbor graph and apply a community detection algorithm to group cells with similar expression profiles.
Downstream analysis — identify marker genes, annotate cells, and compare conditions.

Note 1: Choosing an analysis strategy

A key early decision is whether to work with a single merged object or individual per-sample objects. Both approaches support per-sample QC — the difference is primarily one of workflow style and how interactively you want to inspect and process each sample.

Strategy A: Merge early, use layers (this tutorial)

In Seurat v5, you can merge multiple samples into a single object while retaining per-sample count data in separate layers (e.g., counts.GSM8181589, counts.GSM8181591). QC metrics can be visualized across all samples simultaneously, and thresholds can still be applied on a per-sample basis by filtering on orig.ident. This approach works well for interactive, exploratory analysis where you want to compare samples side by side before deciding on cutoffs.

Best for: interactive workflows where you want to inspect all samples together, compare QC distributions across samples, and make informed filtering decisions before proceeding.

Considerations: Requires more hands-on decision-making at the QC stage, as thresholds are applied manually after inspecting the merged object.

Strategy B: Per-sample objects, process and merge later

Alternatively, you can create a separate Seurat object for each sample, run QC and normalization independently, and then merge or integrate the filtered objects. This approach lends itself to scripted, automated pipelines where each sample is processed with fixed thresholds from QC through to clustering without manual intervention between steps.

Best for: automated or reproducible pipelines where the same workflow is applied to each sample in sequence, or when samples need to be processed independently before any comparison.

Considerations: More objects to manage, but the workflow can be more easily scripted end-to-end. Samples can later be merged with merge().

Working with per-sample objects

If you want to process each sample as an individual object, you can create them using a for loop (see the collapsible tip in the Creating a Seurat Object section below), run QC and normalization per object, and merge the filtered objects before proceeding.

Whether you ultimately merge samples into a single object or keep them separate, an integration step is likely necessary when comparing across conditions or batches.

Importing Data

Now that we have loaded the package, we can import our data.

Generally, in R programming, functions that involve data import begin with “read / Read”. Seurat includes a number of read functions for different data / platform types.

For 10X scRNA-Seq data, the following functions will be most relevant:

Read10X() - primary argument is a directory from CellRanger containing the matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv files.
Read10X_h5() - reads the hdf5 file from 10X CellRanger (scATAC-Seq or scRNA-Seq).
ReadMtx() - read from local or remote (arguments for the .mtx file, cells/barcodes file, and features/genes file). This option is not just for 10X data, but is useful for most platforms and pipelines.

The function you choose will depend on the format of your data.

Note

There are a number of other read and load functions available for different platforms and data types (e.g., Akoya CODEX, Nanostring SMI, SlideSeq, 10X Visium, 10X Xenium, Vizgen MERFISH). You can find relevant functions here.

Importing data in a .tsv or .csv

If you are trying to import publicly available data, and the available count data was saved as a .txt, .tsv, or .csv file, you can import these data using regular base R import functions (e.g., read.table(), read.csv()). The data does not need to be in a sparse format to create a Seurat object. However, if you would like to save memory upon import by loading as a sparse data matrix, see readSparseCounts() from the Bioconductor package scuttle.

Should I use the filtered for unfiltered data from Cell Ranger?

The raw / unfiltered matrix includes all valid cell barcodes from GEMs (Gel Bead-In EMulsions). This includes GEMs with free-floating mRNA from lysed or dead cells. Cell Ranger uses a barcode vs UMI count rank plot (knee plot) to eliminate such background noise.

While using the raw files will give you greater control over filtering, these files are also much larger. You can visualize the web summary output from Cell Ranger to see what was filtered. If you do not see a clear inflection point, you may want to use the raw files and apply other techniques for filtering.

STARsolo

If you use STARsolo, an alternative to 10X CellRanger for gene quantification output, you can use ReadSTARsolo() to import the data with Seurat.

Loading hdf5 files

To load a single file, we use

S89Inf42 <- Read10X_h5("./data/GSM8181589_EEM-scRNA-126.h5")

The only required argument is the path to the 10X h5 file.

To read multiple files, you can use a for loop or a function(s) from the apply family. For example,

# List files 
files <- list.files(path = "./data/", pattern = "*.h5")

# Create a list of count matrices
#lapply returns a list
h5_read <- lapply(paste0("./data/", files), Read10X_h5)

# Automated way to assign names using the GSM accession number
#sapply returns a vector
# names(h5_read) <- sapply(files,
#                         function(x) { str_split_1(x, "_")[1] },
#                         USE.NAMES = FALSE)

# Or assign names manually
# Useful for new names
 names(h5_read) <- c("S89Inf42", "S91Hom42", "S93Hom3", "S99Inf42")

Let’s take a look at S89Inf42.

S89Inf42[1:5,1:5]

5 x 5 sparse Matrix of class "dgCMatrix"
              AAACCCAAGCAACAGC AAACCCAAGGTCGTAG AAACCCACAACTAGAA
4933401J01Rik                .                .                .
Gm26206                      .                .                .
Xkr4                         .                .                .
Gm18956                      .                .                .
Gm37180                      .                .                .
              AAACCCACATCGCTGG AAACCCACATGGAAGC
4933401J01Rik                .                .
Gm26206                      .                .
Xkr4                         .                .
Gm18956                      .                .
Gm37180                      .                .

The data is stored as a sparse matrix to save CPU, time, and memory. Each . is actually a 0.

Creating a Seurat Object

Once we have read in the matrices, the next step is to create a Seurat object. The Seurat object will be used to store the raw count matrices, sample information, and processed data (normalized counts, reductions, etc.).

CreateSeuratObject() is used to create the object. This requires the matrix of counts for the first argument (counts = S89Inf42). We can also include a project name (e.g., the sample name). Other useful arguments include min.cells and min.features, which allow some initial filtering. For example, min.cells = 10 will filter genes / features that are not present across a minimum of 10 cells, while min.features = 200 will filter cells that do not contain a minimum of 200 genes / features.

# Single sample
S89Inf42 <- CreateSeuratObject(counts = S89Inf42, project = "S89Inf42",
                                 min.cells = 10, min.features = 200)

Warning: Feature names cannot have underscores ('_'), replacing with dashes
('-')

# View S89Inf42
S89Inf42

An object of class Seurat 
18500 features across 5797 samples within 1 assay 
Active assay: RNA (18500 features, 0 variable features)
 1 layer present: counts

For all samples, we can use mapply to loop through the h5_read list and the list names (names(h5_read)).

# All samples
lung <- mapply(CreateSeuratObject, counts = h5_read,  
               project = names(h5_read),
               MoreArgs = list(min.cells = 10, min.features = 200))
# View lung 
lung

$S89Inf42
An object of class Seurat 
18500 features across 5797 samples within 1 assay 
Active assay: RNA (18500 features, 0 variable features)
 1 layer present: counts

$S91Hom42
An object of class Seurat 
18821 features across 7329 samples within 1 assay 
Active assay: RNA (18821 features, 0 variable features)
 1 layer present: counts

$S93Hom3
An object of class Seurat 
18965 features across 7142 samples within 1 assay 
Active assay: RNA (18965 features, 0 variable features)
 1 layer present: counts

$S99Inf42
An object of class Seurat 
19207 features across 8226 samples within 1 assay 
Active assay: RNA (19207 features, 0 variable features)
 1 layer present: counts

# Remove the original sparse matrices
rm(h5_read)

We can merge our Seurat objects, which are now in a list, in order to view the QC metrics and compare across samples. If samples are fairly different, QC thresholds should be applied on a per sample basis.

lung <- merge(lung[[1]], y = lung[2:length(lung)], 
              add.cell.ids = names(lung), project = "Lung")

The cell IDs can overlap between samples, so here we append the sample ID as a prefix to each cell ID using add.cell.ids. If this isn’t done “and any cell names are duplicated, cell names will be appended with _X, where X is the numeric index of the object in c(x, y)” (See merge.Seurat()).

Warning

We could have created a merged object using h5_read, as a list can be provided for the argument counts (DO NOT RUN THIS STEP):

lung2 <- CreateSeuratObject(counts = h5_read, min.cells = 3,
                            min.features = 200)

In Seurat 5, passing a list to CreateSeuratObject() is better supported than in earlier versions. However, this approach still does not give us control over cell IDs, and it bypasses the ability to set per-sample project names, which are useful for tracking samples during QC. For these reasons, we prefer the mapply + merge() approach used above.

Use a for loop to load the data and create an object for each file (DO NOT RUN)

If you want a separate object for each object, you can use this for loop:

# Create a Seurat object for each sample
for (file in files){
        seurat_data <- Read10X_h5(paste0("./data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 200, 
                                         min.cells = 3,
                                         project = file)
        assign(file, seurat_obj)
}

See here for an explanation of the code.

Inspect and work with a Seurat Object

The Seurat Object is a data container for single cell RNA-Seq and related data. It is an S4 object, which is a type of data structure that stores complex information (e.g., scRNA-Seq count matrix, associated sample information, and data / results generated from downstream analyses) in one or more slots. It is extensible and can interact with many different methods, including those from other developers.

You can see the primary slots using glimpse().

glimpse(S89Inf42)

Formal class 'Seurat' [package "SeuratObject"] with 13 slots
  ..@ assays      :List of 1
  .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
  ..@ meta.data   :'data.frame':    5797 obs. of  3 variables:
  .. ..$ orig.ident  : Factor w/ 1 level "S89Inf42": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ nCount_RNA  : num [1:5797] 5341 3896 9012 6752 1339 ...
  .. ..$ nFeature_RNA: int [1:5797] 2214 1513 2617 2595 201 3162 2097 1152 3149 3705 ...
  ..@ active.assay: chr "RNA"
  ..@ active.ident: Factor w/ 1 level "S89Inf42": 1 1 1 1 1 1 1 1 1 1 ...
  .. ..- attr(*, "names")= chr [1:5797] "AAACCCAAGCAACAGC" "AAACCCACAACTAGAA" "AAACCCACATCGCTGG" "AAACCCACATGGAAGC" ...
  ..@ graphs      : list()
  ..@ neighbors   : list()
  ..@ reductions  : list()
  ..@ images      : list()
  ..@ project.name: chr "S89Inf42"
  ..@ misc        : list()
  ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
  .. ..$ : int [1:3] 5 4 0
  ..@ commands    : list()
  ..@ tools       : list()

S89Inf42 has 13 slots: assays, meta.data, active.assay, active.ident, graphs, neighbors, reductions, images, misc, version, commands, and tools. See a description of the slots here. For guidance on accessing data stored in the SeuratObject, see this vignette.

Unlike previous iterations of Seurat, Seurat v5 contains assays in layers to accommodate different assays and data types. “For example, these layers can store: raw counts (layer=‘counts’), normalized data (layer=‘data’), or z-scored/variance-stabilized data (layer=‘scale.data’).”

The diagrams below summarize a typical Seurat v5 object structure. The first shows the top-level slots in the object, and the second zooms in on the RNA assay to show how Seurat v5 stores expression matrices in layers.

flowchart TB
    A["Seurat v5 object"]

    A --> B["assays"]
    A --> C["meta.data"]
    A --> D["active.assay"]
    A --> E["active.ident"]
    A --> F["reductions"]
    A --> G["graphs"]
    A --> H["neighbors"]
    A --> I["images"]
    A --> J["misc"]
    A --> K["version"]
    A --> L["commands"]
    A --> M["tools"]

    classDef main fill:#143642,color:#ffffff,stroke:#0b1f26,stroke-width:2px;
    classDef slot fill:#d7efe9,color:#102a2e,stroke:#4f8a7b,stroke-width:1.5px;

    class A main;
    class B,C,D,E,F,G,H,I,J,K,L,M slot;

flowchart TB
    A["Seurat object"]
    A --> B["assays"]
    B --> C["RNA assay"]

    C --> D["layers"]
    D --> E["counts<br/>raw counts"]
    D --> F["data<br/>normalized expression"]
    D --> G["scale.data<br/>scaled expression"]

    E --> H["counts.S89Inf42"]
    E --> I["counts.S91Hom42"]
    E --> J["counts.S93Hom3"]
    E --> K["counts.S99Inf42"]

    C --> L["feature metadata"]
    C --> M["variable features"]

    classDef main fill:#143642,color:#ffffff,stroke:#0b1f26,stroke-width:2px;
    classDef slot fill:#d7efe9,color:#102a2e,stroke:#4f8a7b,stroke-width:1.5px;
    classDef assay fill:#f7d08a,color:#3b2c18,stroke:#b8871a,stroke-width:1.5px;
    classDef layer fill:#f5f1e8,color:#2f2a1f,stroke:#a08d62,stroke-width:1px;
    classDef note fill:#f3f7fb,color:#213547,stroke:#9db4c7,stroke-width:1px;

    class A main;
    class B slot;
    class C,D assay;
    class E,F,G,H,I,J,K layer;
    class L,M note;

In a newly created object, many slots are still empty. As you run preprocessing, clustering, dimensionality reduction, and annotation, Seurat progressively fills these slots with new results.

To return the names of the layers:

Layers(S89Inf42)

[1] "counts"

To access a specific count layer use:

S89Inf42[["RNA"]]$counts |> head()
# or
S89Inf42@assays$RNA$counts |> head()
# or
LayerData(S89Inf42, assay = "RNA", layer = "counts") |> head()

Because we have just created this object, most of the slots are empty. We will save results back to the object and fill the slots as we work.

You can always check the version of Seurat that was used to build the object with:

S89Inf42@version

[1] '5.4.0'

And check commands and parameters used to generate results with:

S89Inf42@commands

list()

We haven’t worked on the object, so there is nothing here. Compare with pbmc_small, a built-in dataset in the SeuratObject package.

head(pbmc_small@commands, 1)

$NormalizeData.RNA
Command: NormalizeData(object = pbmc_small)
Time: 2018-08-27 22:32:17.136699
assay : RNA 
normalization.method : LogNormalize 
scale.factor : 10000 
verbose : TRUE

This information is useful for documenting your analysis.

Examine metadata

The metadata in the Seurat object is located in lung@meta.data and contains the information associated with each cell.

We can access the metadata using:

head(lung@meta.data) # using head to return only the first 6 rows

                          orig.ident nCount_RNA nFeature_RNA
S89Inf42_AAACCCAAGCAACAGC   S89Inf42       5341         2214
S89Inf42_AAACCCACAACTAGAA   S89Inf42       3896         1513
S89Inf42_AAACCCACATCGCTGG   S89Inf42       9012         2617
S89Inf42_AAACCCACATGGAAGC   S89Inf42       6752         2595
S89Inf42_AAACCCAGTATGGAGC   S89Inf42       1339          201
S89Inf42_AAACCCAGTCATCACA   S89Inf42      14257         3162

head(lung[[]])

                          orig.ident nCount_RNA nFeature_RNA
S89Inf42_AAACCCAAGCAACAGC   S89Inf42       5341         2214
S89Inf42_AAACCCACAACTAGAA   S89Inf42       3896         1513
S89Inf42_AAACCCACATCGCTGG   S89Inf42       9012         2617
S89Inf42_AAACCCACATGGAAGC   S89Inf42       6752         2595
S89Inf42_AAACCCAGTATGGAGC   S89Inf42       1339          201
S89Inf42_AAACCCAGTCATCACA   S89Inf42      14257         3162

Accessing a column or columns:

# Access a single column 
head(lung$orig.ident)

S89Inf42_AAACCCAAGCAACAGC S89Inf42_AAACCCACAACTAGAA S89Inf42_AAACCCACATCGCTGG 
               "S89Inf42"                "S89Inf42"                "S89Inf42" 
S89Inf42_AAACCCACATGGAAGC S89Inf42_AAACCCAGTATGGAGC S89Inf42_AAACCCAGTCATCACA 
               "S89Inf42"                "S89Inf42"                "S89Inf42"

head(lung[["orig.ident"]])

                          orig.ident
S89Inf42_AAACCCAAGCAACAGC   S89Inf42
S89Inf42_AAACCCACAACTAGAA   S89Inf42
S89Inf42_AAACCCACATCGCTGG   S89Inf42
S89Inf42_AAACCCACATGGAAGC   S89Inf42
S89Inf42_AAACCCAGTATGGAGC   S89Inf42
S89Inf42_AAACCCAGTCATCACA   S89Inf42

# Access multiple columns and rows 
head(lung[[c("orig.ident", "nCount_RNA")]])[1:3,]

                          orig.ident nCount_RNA
S89Inf42_AAACCCAAGCAACAGC   S89Inf42       5341
S89Inf42_AAACCCACAACTAGAA   S89Inf42       3896
S89Inf42_AAACCCACATCGCTGG   S89Inf42       9012

As we can see, there are different ways to access and work with the metadata. In addition, we can see that we begin with built-in information such as the orig.ident, nCount_RNA, and nFeature_RNA.

orig.ident - defaults to the project labels
nCount_RNA - number of UMIs per cell
nFeature_RNA - number of genes / features per cell

This information, along with other metrics, will be used for quality filtering.

Seurat Idents?

Each cell in a Seurat object can be assigned an identity label, and Seurat stores one set of these labels as the active Ident (active.ident). These identity labels usually come from metadata categories such as sample ID, cluster membership, condition, or cell type annotation. Many Seurat functions use the active.ident automatically when grouping cells for visualization or analysis.See ?Idents() and this vignette for working with idents.

Subsetting the object

We can subset by features or cells by lung[1:10,] or lung[,1:10] respectively. To subset using metadata, use subset().

For example, if we want to subset to the infected samples only, we can use:

subset(x = lung, idents = c("S89Inf42", "S99Inf42"))

An object of class Seurat 
19736 features across 14023 samples within 1 assay 
Active assay: RNA (19736 features, 0 variable features)
 2 layers present: counts.S89Inf42, counts.S99Inf42

We will look at subsetting more when filtering cells based on quality.

Add to metadata

We can add information to our metadata by accessing and assigning to metadata columns or using ?AddMetaData().

Add treatment condition to the metadata (Tamoxifen only vs. Tamoxifen + Influenza):

lung$treatment <- ifelse(str_detect(lung@meta.data$orig.ident, 
                                    "S91Hom42|S93Hom3"),
                         "Tamoxifen", "Tamoxifen_Influenza")

Add time point information to the metadata:

lung$time_point <- case_when(
  str_detect(lung@meta.data$orig.ident, "S93Hom3") ~ "Homeostasis_Day3",
  str_detect(lung@meta.data$orig.ident, "S91Hom42") ~ "Homeostasis_Day42",
  .default = "Infection_Day42"
)

Add treatment + time point to the metadata:

lung$treatment_tp <- paste(lung$treatment, lung$time_point, sep="_")

Other useful information from the Seurat object

The Seurat v5 object doesn’t require all assays have the same cells. In this case, Cells() can be used to return the cell names of the default assay while colnames() can be used to return the cell names of the entire object.

head(Cells(lung, layer = "counts.S89Inf42"))

[1] "S89Inf42_AAACCCAAGCAACAGC" "S89Inf42_AAACCCACAACTAGAA"
[3] "S89Inf42_AAACCCACATCGCTGG" "S89Inf42_AAACCCACATGGAAGC"
[5] "S89Inf42_AAACCCAGTATGGAGC" "S89Inf42_AAACCCAGTCATCACA"

head(colnames(lung))

[1] "S89Inf42_AAACCCAAGCAACAGC" "S89Inf42_AAACCCACAACTAGAA"
[3] "S89Inf42_AAACCCACATCGCTGG" "S89Inf42_AAACCCACATGGAAGC"
[5] "S89Inf42_AAACCCAGTATGGAGC" "S89Inf42_AAACCCAGTCATCACA"

To return feature / gene names:

head(Features(lung))

[1] "Xkr4"    "Gm19938" "Rp1"     "Sox17"   "Gm37587" "Gm7357"

head(rownames(lung))

[1] "Xkr4"    "Gm19938" "Rp1"     "Sox17"   "Gm37587" "Gm7357"

For multimodal data, list the assays using:

Assays(lung)

[1] "RNA"

To return the number of cells or features use:

# Return number of cells across all layers
num_cells <- ncol(lung)
# Return number of features across all layers
num_features <- nrow(lung)
# Return row by column information
dim(lung)

[1] 20329 28494

Note

ncol(lung) will return the number of cells across all layers. If you want a specific layer, use ncol(lung[["RNA"]]$counts.S89Inf42). This information is also quickly accessible via the Global Environment when using RStudio or by using str().

We can also quickly plot the number of cells from the metadata:

# colors
sample_colors <- c(
  S89Inf42 = "#0072B2",  # blue
  S91Hom42 = "#E69F00",  # orange
  S93Hom3 = "#009E73",  # bluish green
  S99Inf42 = "#CC79A7"   # reddish purple
)
# Visualize the number of cell counts per sample
lung@meta.data %>% 
  ggplot(aes(x = orig.ident, fill = orig.ident)) + 
  geom_bar(color = "black") +
  stat_count(geom = "text", colour = "black", size = 3.5, 
             aes(label = after_stat(count)),
             position = position_stack(vjust = 0.5)) +
  scale_fill_manual(values=sample_colors)+
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
  ggtitle("Number of Cells per Sample")

Save the object

At this point, our Seurat object may be fairly large. It is wise to save this for downstream applications using saveRDS().

saveRDS(lung, "./outputs/merged_Seurat_lung.rds")

References

The following resources were instrumental in designing this lesson:

--- title: "scRNA-seq Analysis Ecosystems: Tools, Resources, & Getting Started with Seurat" format: html: embed-resources: true code-tools: true code-overflow: wrap code-link: true execute: keep-md: true toc: true date-modified: last-modified --- ## Learning Objectives * Learn about options for analyzing your scRNA-Seq data. * Learn about resources for learning R / Python programming. * Learn about AI resources * Learn how to import your data for working with R. * Learn about Seurat and the Seurat object including how to create the object and access and manipulate stored data. --- ## Exploratory Data Analysis: scRNA-Seq This tutorial assumes that all pre-processing steps (read demultiplexing, FASTQ QC, reference based alignment, error correction) have been completed. At this stage of the analysis, a gene-by-cell count matrix has been generated for each sample. --- ### Analysis Ecosystems Following Cell Ranger or other preprocessing tools, you will typically obtain a gene-by-cell count matrix for each sample. The three dominant open-source ecosystems for downstream single-cell analysis are: --- #### 1. R — **Seurat** [https://satijalab.org/seurat/articles/get_started_v5_new](https://satijalab.org/seurat/articles/get_started_v5_new) Seurat (Satija Lab) is a unified, high-level framework for single-cell transcriptomics, multimodal data (e.g., RNA + ATAC), and spatial transcriptomics. It has a large user base and extensive documentation, making it an excellent starting point for new analysts. Seurat v5 introduced a **layered assay structure** and improved support for large-scale datasets, including disk-backed matrices and sketch-based workflows for million-cell analyses. The ecosystem also supports reference mapping (e.g., Azimuth) and integrates well with multimodal workflows. --- #### 2. R — **Bioconductor (SingleCellExperiment ecosystem)** [https://bioconductor.org/books/release/OSCA/](https://bioconductor.org/books/release/OSCA/) Bioconductor provides a modular, statistically rigorous ecosystem for analyzing biological data. Single-cell workflows are built around the `SingleCellExperiment` object and leverage highly interoperable packages (70+) including `scater`, `scran`, `scuttle`, and `batchelor`. Rather than a single monolithic framework, Bioconductor emphasizes composability and statistical transparency. Resources: * [*Orchestrating Single-Cell Analysis with Bioconductor*](https://bioconductor.org/books/release/OSCA/){target=_blank} - focuses on scRNA-Seq * [*Orchestrating Spatial Transcriptomics Analysis with Bioconductor*](https://bioconductor.org/books/release/OSTA/){target=_blank} - focuses on spatial transcriptomics --- #### 3. Python — **scverse** [https://scverse.org/learn/](https://scverse.org/learn/) The `scverse` ecosystem is a collection of interoperable Python packages centered around the [`AnnData`](https://anndata.readthedocs.io/en/latest/){target=_blank} object. Core tools include: * `scanpy` (core analysis workflows) * `scvi-tools` (deep generative modeling, batch correction, multimodal integration) * `muon` / `mudata` (multimodal analysis) * `scirpy` (immune repertoire) * `squidpy` (spatial transcriptomics) scverse is widely adopted in large atlas-scale projects and is particularly strong for machine learning–based workflows and scalable analysis. Resources: * [Single-cell best practices](https://www.sc-best-practices.org/preamble.html){target=_blank} - covers different analysis ecosystems and workflows. --- These ecosystems require familiarity with either R or Python. Your choice will likely depend on your background, your lab’s preferences, and your scientific goals. ::: {.callout-note} There are also tools outside of R and Python. ::: --- ::: {.callout-important} ### Interoperability Across Ecosystems In practice, you may benefit from using tools across multiple ecosystems. Seurat, Bioconductor, and scverse use different object classes, but they store many of the same core components: expression matrices, cell metadata, feature metadata, dimensional reductions, graphs, and other analysis results. | Concept | `Seurat` | `SingleCellExperiment` | `AnnData` | |---|---|---|---| | Main expression data | `assays`, such as the `RNA` assay | `assays`, such as `counts` or `logcounts` | `X` and `layers` | | Cell metadata | `meta.data` | `colData` | `obs` | | Gene / feature metadata | stored within assays | `rowData` | `var` | | Dimensional reductions | `reductions` | `reducedDims` | `obsm` | | Graphs / cell-cell relationships | `graphs`, `neighbors` | `colPairs` | `obsp` | | Multimodal support | Multiple assays, e.g., RNA, ATAC, ADT | `altExp()` for alternative feature sets; `MultiAssayExperiment` for broader multi-omics | Usually one modality per `AnnData`; `MuData` object for multimodal | | Miscellaneous results | `misc`, `tools`, `commands` | `metadata` | `uns` | One important difference is matrix orientation. In `Seurat` and `SingleCellExperiment`, expression matrices are typically stored as **features x cells**. In `AnnData`, the primary matrix is stored as **observations x variables**, which for scRNA-seq usually means **cells x genes**. Conversion tools handle this transpose, but it is a common source of confusion when manually inspecting objects. A useful mental map is: * `Seurat@meta.data`, `colData(sce)`, and `adata.obs` all describe cells. * Seurat assay-level metadata describes genes or features, although some user- or tool-generated feature information may also appear in `misc`; the closest equivalents are `rowData(sce)` and `adata.var`. * Seurat assay `layers`, SCE `assays`, and AnnData `layers` can store different versions of the expression matrix, such as raw counts, normalized data, or scaled data. * PCA, UMAP, and t-SNE coordinates live in Seurat `reductions`, SCE `reducedDims`, and AnnData `obsm`. [`AnnData` (`.h5ad`)](https://anndata.readthedocs.io/en/latest/){target=_blank} remains the most useful cross-platform exchange format for moving single-cell data between Python, Seurat, and Bioconductor workflows. For R-based conversion, [`anndataR`](https://bioconductor.org/packages/release/bioc/html/anndataR.html){target=_blank} is now our preferred starting point because it can read and write `.h5ad` files and convert to/from both `Seurat` and `SingleCellExperiment` objects. [`zellkonverter`](https://bioconductor.org/packages/release/bioc/html/zellkonverter.html){target=_blank} remains useful for `SingleCellExperiment` <-> `AnnData` workflows, especially in Bioconductor-centered analyses. Check out the [`anndataR` documentation](https://anndatar.scverse.org/articles/usage_seurat.html){target=_blank} for helpful diagrams showing how `AnnData` maps to `Seurat` and [`SingleCellExperiment`](https://anndatar.scverse.org/articles/usage_singlecellexperiment.html){target=_blank}. You may still see older tutorials recommending `SeuratDisk` for `.h5Seurat` <-> `.h5ad` conversion, but it does not appear to be actively maintained for current Seurat workflows and is not a reliable choice for Seurat v5 objects. Even with better tooling, conversion can introduce edge cases. After conversion, verify that cell and gene identifiers, count layers, metadata, embeddings, and dimensional reductions are where you expect them. Older or non-standard `.h5ad` files can also cause problems; in those cases, it may be easiest to open the file in a Python `anndata` / `scanpy` environment and write out an updated `.h5ad` before converting again. ::: ::: {.callout-tip collapse=true} ## Point-and-click alternatives If you lack programming skills and learning to code seems daunting, there are GUI based alternatives including, but not limited to the following: * Partek Flow ([License available to NCI researchers](https://bioinformatics.ccr.cancer.gov/btep/partek-flow-bulk-and-single-cell-rna-seq-data-analysis/){target=_blank}) - Relevant trainings: - [Introduction to Single Cell RNA Sequencing Analysis using Partek Flow](https://cbiit.webex.com/cbiit/ldr.php?RCID=e3447199f5d276671451460eb53b1c09){target=_blank} - [Partek Flow Integration of single cell RNA sequencing and single cell ATAC sequencing data](https://cbiit.webex.com/cbiit/ldr.php?RCID=df283b4fdd54da9169ac4976b4498dcc){target=_blank} - [Advanced single cell RNA sequencing analysis](https://cbiit.webex.com/cbiit/ldr.php?RCID=a9e4ea60bf8674f5298e640270d35757){target=_blank} * Qlucore Omics Explorer ([License available to NCI-CCR researchers](https://bioinformatics.ccr.cancer.gov/btep/qlucore-bioinformatics-software-for-next-gen-seq-analysis/){target=_blank}). ::: --- ## Python Learning Resources * [Resources for Beginners from python.org](https://www.python.org/about/gettingstarted/){target=_blank} * [Python Introductory Education Series](https://bioinformatics.ccr.cancer.gov/btep/courses/python-introductory-education-series-2025){target=_blank} * [Glittr Resources](https://glittr.org/?per_page=25&sort_by=stargazers&sort_direction=desc&search=python&tags=59){target=_blank} ## R Learning Resources * [R Cheat Sheets](https://posit.co/resources/cheatsheets/){target=_blank}. * [R Crash Course](./RCrashCourse.html.md){target=_blank} * Free tutorials. There are at least 230 Git repositories that focus on R training related to bioinformatics. Check out [Glittr](https://glittr.org/?per_page=25&sort_by=stargazers&sort_direction=desc&tags=61){target=_blank}. ### BTEP lessons / courses * [R Introductory Series](https://bioinformatics.ccr.cancer.gov/docs/rintro/index.html){target=_blank} * [Data Wrangling with R](https://bioinformatics.ccr.cancer.gov/docs/data-wrangle-with-r/){target=_blank} * [Data Visualization with R](https://bioinformatics.ccr.cancer.gov/docs/data-visualization-with-r/index.html){target=_blank} * [A Beginner's Guide to Troubleshooting R Code](https://bioinformatics.ccr.cancer.gov/docs/btep-coding-club/CC2023/Troubleshooting_Rprog/){target=_blank} ## Python Vs R The argument isn't necessarily about which language is better, but which one you are most comfortable with and can get the job done. <iframe src="r_vs_python_bioinformatics.html" width="100%" height="550px" frameborder="0"></iframe> **For the remainder of this tutorial, we will focus on the Seurat ecosystem.** ## What is Seurat? > Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. Seurat aims to enable users to identify and interpret sources of heterogeneity from single-cell transcriptomic measurements, and to integrate diverse types of single-cell data. ---[https://satijalab.org/seurat/](https://satijalab.org/seurat/){target=_blank} Advantages: * beginning-to-end workflows * Nice vignettes * Regularly updated * Somewhat extensible * Scalable Disadvantages: * Updates break some functionality * Interoperability is lacking * Maintained by a single group * Many functions are poorly documented #### Updates with Seurat v5 You can read about major changes between Seurat v5 and v4 [here](https://satijalab.org/seurat/articles/announcements.html){target=_blank}. Seurat v5 introduced the following new features: * Integrative multi-modal analysis with bridge integration * ‘Sketch’-based analysis of large data sets * methods for spatial transcriptomics * assay layers ### How much R programming do I need to know to use Seurat? The answer to this question will largely depend on the user. While some will be able to learn R on the go, others will need to know some amount of R programming prior to beginning their analysis. If needed, check out this quick [R Refresher](./RCrashCourse.html). There are also several [R courses available through BTEP](#btep-lessons-courses) with self-contained, asynchronous content that can be used to learn independently of live sessions. **The most important thing to know is how to find help!** ### How to find help? ALWAYS, ALWAYS, ALWAYS read the documentation. Use the help pane in the lower right of RStudio or the functions, `help()` and `help.search()` or `?` and `??`. Check out package vignettes (`vignette()`). Check the Github site if one is available for help with a specific package. There may also be known issues listed under the Github Issues tab ([Seurat issues](https://github.com/satijalab/seurat/issues){target=_blank}). #### Google for help! Use Google or other internet browser to troubleshoot error message or find help performing a specific task. But, make sure you are precise and informative in your search. Stack Overflow is a particularly great resource. A good Google search includes 3 elements: 1. The specific action (e.g., how to rename a column). 2. The programming language (e.g., R programming). 3. The specific style/technique for coding (e.g., dplyr or tidyverse package) Example: "How to rename a column in R with dplyr/tidyverse”. --- [https://crd230.github.io/tips.html](https://crd230.github.io/tips.html){target=_blank}. If this fails, feel free to email us at [BTEP](mailto:ncibtep@nih.gov){target=_blank}. #### Use Generative AI Use Generative AI for help. For a list of generative AI resources available to NIH researchers, see our [BTEP GenAI Resource Guide](https://bioinformatics.ccr.cancer.gov/docs/resources-for-bioinformatics/generative_ai/){target=_blank}. ## GenAI and Single Cell Generative AI is beginning to support single-cell analysis by helping with workflow guidance, cluster annotation, marker interpretation, and summarizing results. * [**SCassist**](https://pubmed.ncbi.nlm.nih.gov/40650988/){target=_blank} is one example in the Seurat ecosystem; it uses large language models to help guide parameter choices and interpret scRNA-seq results. * [**CellWhisperer**](https://www.nature.com/articles/s41587-025-02857-9){target=_blank} is a newer natural-language interface for exploring single-cell datasets, making it easier to query and interpret results conversationally. * [**scGPT**](https://www.nature.com/articles/s41592-024-02201-0){target=_blank} and [**Geneformer**](https://www.nature.com/articles/s41586-023-06139-9){target=_blank} are examples of single-cell foundation models trained on large transcriptomic datasets. These are primarily research tools and may support tasks such as cell annotation, representation learning, and prediction. * Although these tools are promising, AI-generated suggestions should be treated as decision support rather than ground truth and should always be checked against QC metrics, marker genes, metadata, and biological context. ## Accessing R / RStudio ### Using R locally Single cell analysis can be computationally demanding. Running the analysis locally will likely only be possible with smaller sample sizes ( <= 8 samples). ::: {.callout-note title="R Installation" collapse="true"} Mac or Windows (local installation): Follow instructions here: <https://posit.co/download/rstudio-desktop/>. NIH-managed machines: Follow your institute’s IT installation policies or submit a service request if administrative privileges are required. ::: ::: {.callout-note title="Scalability with Seurat" collapse="true"} ### BPCells The package `BPCells` can be used to manage SeuratObjects and make your analysis more computationally efficient. BPCells is an R package that stores single-cell count matrices on disk in a compressed format so very large datasets can be analyzed with much lower memory use. However, if your Seurat object fits comfortably in memory, don’t use BPCells just because it exists. Once you are in the 100k+ cell range, merging many samples, or starting to hit RAM limits, BPCells becomes worthwhile. For atlas-scale or million-cell work, it is the recommended direction, but expect some compatibility tradeoffs with plotting and third-party tools. See [this vignette](https://satijalab.org/seurat/articles/seurat5_bpcells_interaction_vignette){target=_blank} for more information. ### Sketch-Based Assays with Seurat v5 A sketch-based assay in **Seurat v5** is a smaller, in-memory assay built from a representative subset of cells from a much larger dataset, typically using leverage-score sampling so that both common and rare cell states are better preserved than with simple random subsampling. Consider using a sketch when your dataset is too large for convenient in-memory analysis, such as large multi-sample or million-cell studies, but remember that the sketch is still an approximation: important populations should be checked for stability across sketch sizes and validated by projecting results back to the full dataset. Useful references: [Seurat v5 sketch analysis vignette](https://satijalab.org/seurat/articles/seurat5_sketch_analysis), [SketchData reference](https://satijalab.org/seurat/reference/sketchdata), [LeverageScore reference](https://satijalab.org/seurat/reference/leveragescore), and [BPCells with Seurat Objects](https://satijalab.org/seurat/articles/seurat5_bpcells_interaction_vignette.html). ::: ### HPC Open OnDemand Due to limits on computational resources, you may be interested in running your analysis on an HPC. [Biowulf](https://hpc.nih.gov/){target=_blank} is the NIH high performance compute cluster. It has greater than 90k processors, and can easily perform large numbers of simultaneous jobs. Biowulf also includes greater than 600 pre-installed scientific software and databases. To use the NIH HPC Systems, you must have an [NIH HPC account](https://hpc.nih.gov/docs/accounts.html){target=_blank}. When using R interactively, the easiest way to access R on Biowulf is via an integrated development environment (e.g. RStudio, VSCode, etc.). These interactive applications are available on Biowulf through [NIH HPC Open OnDemand](https://hpc.nih.gov/ondemand/index.html){target=_blank}. #### To Connect to R with HPC Open OnDemand 1. Connect to the NIH network and use the following link to access Open OnDemand: <https://hpcondemand.nih.gov>{target=_blank}. 2. Login with your PIV Card. 3. Select the interactive app of interest (e.g., RStudio). 4. Adjust computational resources for your session (CPUs, memory, local scratch, etc.) and Launch. * A good starting point for scRNA-Seq: * Number of Hours - 8 * Number of CPUs - 8 * Allocated Memory (GB) - 100 * Allocated Local Scratch (GB) - 50 5. Start the session. ::: {.callout-note collapse=true} ## Need a Biowulf Account? If you do not already have a Biowulf account, you can obtain one by following the instructions [here](https://hpc.nih.gov/docs/accounts.html){target=_blank}. NIH HPC accounts are available to all NIH employees and contractors listed in the NIH Enterprise Directory. Obtaining an account requires PI approval and a nominal fee of $40 per month. Accounts are renewed annually contingent upon PI approval. When you apply for a Biowulf account you will be issued two primary storage spaces: `/home/$USER` (16 GB) `/data/$USER` (100 GB) *You may request more space in `/data/$USER` by filing an [online storage request](https://hpcnihapps.cit.nih.gov/auth/dashboard/storage_request.php){target=_blank}.* ::: #### NIH HPC Documentation If you need help with Biowulf, the NIH HPC systems are well-documented at [hpc.nih.gov](https://hpc.nih.gov/){target=_blank}. The [User guides](https://hpc.nih.gov/docs/user_guides.html){target=_blank}, [Training documentation](https://hpc.nih.gov/training/){target=_blank}, and [How To](https://hpc.nih.gov/docs/how_to.html){target=_blank} docs are also fantastic resources for getting help with most HPC tasks. For all other questions, consider emailing the HPC team directly at [staff@hpc.nih.gov.](mailto:staff@hpc.nih.gov){target=_blank} ## Getting Started with Seurat ### The Data In this tutorial, we are using a subset of data from [Niethamer et al. 2025, *Longitudinal single-cell profiles of lung regeneration after viral infection reveal persistent injury-associated cell states*, published in *Cell Stem Cell*](https://doi.org/10.1016/j.stem.2024.12.002){target="_blank"}. Count matrices are available in GEO [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE262927){target="_blank"}, and the raw fastq files are available in the SRA. A project directory containing these data and associated files can be downloaded [here](scRNASeq2026_Example.zip). This study examined lung regeneration following influenza viral infection using single-cell RNA sequencing to track cell states longitudinally. The full study profiled multiple genotypes; however, the four samples used in this tutorial were all generated from mice carrying the Ki67-CreERT2; ROSA26-LSL-tdTomato reporter system, which enables lineage tracing of proliferating cells following tamoxifen induction. Our subset includes lung tissue collected at two homeostasis time points (day 3 and day 42) and two samples from the post-infection condition at day 42. The two post-infection samples (GSM8181589 and GSM8181599) serve as biological replicates of the infected condition at day 42, enabling comparison of injury-associated lung cell states against homeostatic controls. | Sample | Species | Genotype | Treatment | Time Point | Status | |-------------------------|---------|-----------------------------------|-----------------------|------------------|-------------------------| | GSM8181589 (`S89Inf42`) | Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen + Influenza | Day 42 post-infection | Lung (post-infection) | | GSM8181591 (`S91Hom42`) | Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen | Homeostasis Day 42 | Lung (homeostasis) | | GSM8181593 (`S93Hom3`) | Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen | Homeostasis Day 3 | Lung (homeostasis) | | GSM8181599 (`S99Inf42`) | Mouse | Ki67-CreERT2; ROSA26-LSL-tdTomato | Tamoxifen + Influenza | Day 42 post-infection | Lung (post-infection) | ### Installation Seurat can be installed directly from CRAN. ```{r} #| eval: false install.packages("Seurat") ``` If you would like to install the development version or previous versions, see the installation instructions available [here](https://satijalab.org/seurat/articles/install){target=_blank}. ::: {.callout-note} The version of R on Biowulf used for this tutorial is 4.5.2 (`module -r avail '^R$'`), and the version of Seurat that complements this installation is v5.4.0. Seurat does not need to be installed if using R on Biowulf. This tutorial does not use `renv` for package management. Instead, we rely on the HPC's centrally managed R library, which ensures that all packages are compiled against the correct system libraries. If you are running this tutorial on a personal machine or a different computing environment, you may need to install the required packages manually and may encounter system dependency issues not covered here. ::: Load the package from your R library. We will also go ahead and load packages from the tidyverse. ```{r} #| message: false #| warning: false library(Seurat) library(tidyverse) ``` ### Seurat Pre-processing Overview The steps below outline the standard Seurat preprocessing workflow, from raw count matrices to clustered cells. Before diving into the code, it is helpful to understand both the overall sequence of steps and the key decision you will make early on: **Your analysis strategy (See @nte-AnalysisStrategy)**. #### Core preprocessing steps Regardless of strategy, the major steps are the same: 1. **Import data** — read count matrices from Cell Ranger / STARsolo output or other (h5, mtx, or similar formats). 2. **Quality control** — filter low-quality cells based on the number of detected genes (`nFeature_RNA`), UMI counts (`nCount_RNA`), the percentage of mitochondrial reads, and other factors. 3. **Normalization** — scale gene expression values to account for differences in sequencing depth across cells (e.g., log-normalization) (`NormalizeData()`). 5. **Feature selection** — identify highly variable genes that drive biological differences, reducing noise from lowly or uniformly expressed genes (`FindVariableFeatures()`). *Handled automatically by `SCTransform()`.* 4. **Scaling** — center and scale each gene's expression across cells so that highly expressed genes do not dominate downstream analysis (`ScaleData()`). This also allows regression of unwanted sources of variation (e.g. cell cycle, mitochondrial content). > **Note:** `SCTransform()` replaces both normalization and scaling (steps 3–4) in a single step using regularized negative binomial regression. It simultaneously corrects for sequencing depth and scales the data, and also performs feature selection (step 5). 6. **Dimensionality reduction** — use PCA to compress the data, followed by UMAP or t-SNE for visualization. 7. **Clustering** — build a shared nearest-neighbor graph and apply a community detection algorithm to group cells with similar expression profiles. 8. **Downstream analysis** — identify marker genes, annotate cells, and compare conditions. :::: {#nte-AnalysisStrategy .callout-note title="Choosing an analysis strategy" collapse="true"} A key early decision is whether to work with a **single merged object** or **individual per-sample objects**. Both approaches support per-sample QC — the difference is primarily one of workflow style and how interactively you want to inspect and process each sample. #### Strategy A: Merge early, use layers (this tutorial) In Seurat v5, you can merge multiple samples into a single object while retaining per-sample count data in separate **layers** (e.g., `counts.GSM8181589`, `counts.GSM8181591`). QC metrics can be visualized across all samples simultaneously, and thresholds can still be applied on a per-sample basis by filtering on `orig.ident`. This approach works well for interactive, exploratory analysis where you want to compare samples side by side before deciding on cutoffs. **Best for**: interactive workflows where you want to inspect all samples together, compare QC distributions across samples, and make informed filtering decisions before proceeding. **Considerations**: Requires more hands-on decision-making at the QC stage, as thresholds are applied manually after inspecting the merged object. #### Strategy B: Per-sample objects, process and merge later Alternatively, you can create a separate Seurat object for each sample, run QC and normalization independently, and then merge or integrate the filtered objects. This approach lends itself to scripted, automated pipelines where each sample is processed with fixed thresholds from QC through to clustering without manual intervention between steps. **Best for**: automated or reproducible pipelines where the same workflow is applied to each sample in sequence, or when samples need to be processed independently before any comparison. **Considerations**: More objects to manage, but the workflow can be more easily scripted end-to-end. Samples can later be merged with `merge()`. ::: {.callout-tip collapse=true} ## Working with per-sample objects If you want to process each sample as an individual object, you can create them using a `for` loop (see the collapsible tip in the **Creating a Seurat Object** section below), run QC and normalization per object, and merge the filtered objects before proceeding. Whether you ultimately merge samples into a single object or keep them separate, an integration step is likely necessary when comparing across conditions or batches. ::: :::: ### Importing Data Now that we have loaded the package, we can import our data. Generally, in R programming, functions that involve data import begin with "read / Read". Seurat includes a number of read functions for different data / platform types. For 10X scRNA-Seq data, the following functions will be most relevant: * `Read10X()` - primary argument is a directory from CellRanger containing the matrix.mtx, genes.tsv (or features.tsv), and barcodes.tsv files. * `Read10X_h5()` - reads the hdf5 file from 10X CellRanger (scATAC-Seq or scRNA-Seq). * `ReadMtx()` - read from local or remote (arguments for the .mtx file, cells/barcodes file, and features/genes file). **This option is not just for 10X data, but is useful for most platforms and pipelines**. The function you choose will depend on the format of your data. ::: {.callout-note} There are a number of other read and load functions available for different platforms and data types (e.g., Akoya CODEX, Nanostring SMI, SlideSeq, 10X Visium, 10X Xenium, Vizgen MERFISH). You can find relevant functions [here](https://satijalab.org/seurat/reference/){target=_blank}. ::: ::: {.callout-note} ## Importing data in a .tsv or .csv If you are trying to import publicly available data, and the available count data was saved as a `.txt`, `.tsv`, or `.csv` file, you can import these data using regular base R import functions (e.g., `read.table()`, `read.csv()`). The data does not need to be in a sparse format to create a Seurat object. However, if you would like to save memory upon import by loading as a sparse data matrix, see [`readSparseCounts()`](https://rdrr.io/bioc/scuttle/man/readSparseCounts.html){target=_blank} from the Bioconductor package [`scuttle`](https://www.bioconductor.org/packages/release/bioc/html/scuttle.html){target=_blank}. ::: #### Should I use the filtered for unfiltered data from Cell Ranger? The raw / unfiltered matrix includes all valid cell barcodes from GEMs (Gel Bead-In EMulsions). This includes [GEMs with free-floating mRNA from lysed or dead cells](https://kb.10xgenomics.com/hc/en-us/articles/360001892491-What-is-the-difference-between-the-filtered-and-raw-gene-barcode-matrix). Cell Ranger uses a barcode vs UMI count rank plot (knee plot) to eliminate such background noise. While using the raw files will give you greater control over filtering, these files are also much larger. You can visualize the web summary output from Cell Ranger to see what was filtered. If you do not see a clear inflection point, you may want to use the raw files and apply other techniques for filtering. ::: {.callout-note title="STARsolo"} If you use STARsolo, an alternative to 10X CellRanger for gene quantification output, you can use `ReadSTARsolo()` to import the data with Seurat. ::: #### Loading hdf5 files To load a single file, we use ```{r} S89Inf42 <- Read10X_h5("./data/GSM8181589_EEM-scRNA-126.h5") ``` The only required argument is the path to the 10X h5 file. To read multiple files, you can use a for loop or a function(s) from the apply family. For example, ```{r} #| cache: true #| cache-lazy: false # List files files <- list.files(path = "./data/", pattern = "*.h5") # Create a list of count matrices #lapply returns a list h5_read <- lapply(paste0("./data/", files), Read10X_h5) # Automated way to assign names using the GSM accession number #sapply returns a vector # names(h5_read) <- sapply(files, # function(x) { str_split_1(x, "_")[1] }, # USE.NAMES = FALSE) # Or assign names manually # Useful for new names names(h5_read) <- c("S89Inf42", "S91Hom42", "S93Hom3", "S99Inf42") ``` Let's take a look at `S89Inf42`. ```{r} S89Inf42[1:5,1:5] ``` The data is stored as a sparse matrix to save CPU, time, and memory. Each `.` is actually a 0. ### Creating a Seurat Object Once we have read in the matrices, the next step is to create a Seurat object. The Seurat object will be used to store the raw count matrices, sample information, and processed data (normalized counts, reductions, etc.). `CreateSeuratObject()` is used to create the object. This requires the matrix of counts for the first argument (`counts = S89Inf42`). We can also include a project name (e.g., the sample name). Other useful arguments include `min.cells` and `min.features`, which allow some initial filtering. For example, `min.cells = 10` will filter genes / features that are not present across a minimum of 10 cells, while `min.features = 200` will filter cells that do not contain a minimum of 200 genes / features. ```{r} # Single sample S89Inf42 <- CreateSeuratObject(counts = S89Inf42, project = "S89Inf42", min.cells = 10, min.features = 200) # View S89Inf42 S89Inf42 ``` For all samples, we can use `mapply` to loop through the `h5_read` list and the list names (`names(h5_read)`). ```{r} #| cache: true #| cache-lazy: false #| warning: false # All samples lung <- mapply(CreateSeuratObject, counts = h5_read, project = names(h5_read), MoreArgs = list(min.cells = 10, min.features = 200)) # View lung lung # Remove the original sparse matrices rm(h5_read) ``` We can merge our Seurat objects, which are now in a list, in order to view the QC metrics and compare across samples. If samples are fairly different, **QC thresholds should be applied on a per sample basis**. ```{r} lung <- merge(lung[[1]], y = lung[2:length(lung)], add.cell.ids = names(lung), project = "Lung") ``` The cell IDs can overlap between samples, so here we append the sample ID as a prefix to each cell ID using `add.cell.ids`. If this isn't done "and any cell names are duplicated, cell names will be appended with _X, where X is the numeric index of the object in c(x, y)" (See `merge.Seurat()`). :::{.callout-warning} We could have created a merged object using `h5_read`, as a list can be provided for the argument `counts` (DO NOT RUN THIS STEP): ```{r} #| eval: false lung2 <- CreateSeuratObject(counts = h5_read, min.cells = 3, min.features = 200) ``` In Seurat 5, passing a list to `CreateSeuratObject()` is better supported than in earlier versions. However, this approach still does not give us control over cell IDs, and it bypasses the ability to set per-sample project names, which are useful for tracking samples during QC. For these reasons, we prefer the `mapply` + `merge()` approach used above. ::: ::: {.callout-tip collapse=true} ## Use a for loop to load the data and create an object for each file (DO NOT RUN) If you want a separate object for each object, you can use this for loop: ```{r} #| eval: false # Create a Seurat object for each sample for (file in files){ seurat_data <- Read10X_h5(paste0("./data/", file)) seurat_obj <- CreateSeuratObject(counts = seurat_data, min.features = 200, min.cells = 3, project = file) assign(file, seurat_obj) } ``` See [here](https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/03_SC_quality_control-setup.md){target=_blank} for an explanation of the code. ::: ### Inspect and work with a Seurat Object The Seurat Object is a data container for single cell RNA-Seq and related data. It is an S4 object, which is a type of data structure that stores complex information (e.g., scRNA-Seq count matrix, associated sample information, and data / results generated from downstream analyses) in one or more slots. It is extensible and can interact with many different methods, including those from other developers. You can see the primary slots using `glimpse()`. ```{r} glimpse(S89Inf42) ``` `S89Inf42` has 13 slots: assays, meta.data, active.assay, active.ident, graphs, neighbors, reductions, images, misc, version, commands, and tools. See a description of the slots [here.](https://satijalab.github.io/seurat-object/reference/Seurat-class.html#see-also-1){target=_blank} **For guidance on accessing data stored in the SeuratObject, see [this vignette](https://satijalab.org/seurat/articles/seurat5_essential_commands){target=_blank}.** Unlike previous iterations of Seurat, Seurat v5 contains assays in layers to accommodate different assays and data types. ["For example, these layers can store: raw counts (layer='counts'), normalized data (layer='data'), or z-scored/variance-stabilized data (layer='scale.data')."](https://satijalab.org/seurat/articles/announcements.html){target=_blank} The diagrams below summarize a typical Seurat v5 object structure. The first shows the top-level slots in the object, and the second zooms in on the `RNA` assay to show how Seurat v5 stores expression matrices in **layers**. ```{mermaid} flowchart TB A["Seurat v5 object"] A --> B["assays"] A --> C["meta.data"] A --> D["active.assay"] A --> E["active.ident"] A --> F["reductions"] A --> G["graphs"] A --> H["neighbors"] A --> I["images"] A --> J["misc"] A --> K["version"] A --> L["commands"] A --> M["tools"] classDef main fill:#143642,color:#ffffff,stroke:#0b1f26,stroke-width:2px; classDef slot fill:#d7efe9,color:#102a2e,stroke:#4f8a7b,stroke-width:1.5px; class A main; class B,C,D,E,F,G,H,I,J,K,L,M slot; ``` ```{mermaid} flowchart TB A["Seurat object"] A --> B["assays"] B --> C["RNA assay"] C --> D["layers"] D --> E["counts<br/>raw counts"] D --> F["data<br/>normalized expression"] D --> G["scale.data<br/>scaled expression"] E --> H["counts.S89Inf42"] E --> I["counts.S91Hom42"] E --> J["counts.S93Hom3"] E --> K["counts.S99Inf42"] C --> L["feature metadata"] C --> M["variable features"] classDef main fill:#143642,color:#ffffff,stroke:#0b1f26,stroke-width:2px; classDef slot fill:#d7efe9,color:#102a2e,stroke:#4f8a7b,stroke-width:1.5px; classDef assay fill:#f7d08a,color:#3b2c18,stroke:#b8871a,stroke-width:1.5px; classDef layer fill:#f5f1e8,color:#2f2a1f,stroke:#a08d62,stroke-width:1px; classDef note fill:#f3f7fb,color:#213547,stroke:#9db4c7,stroke-width:1px; class A main; class B slot; class C,D assay; class E,F,G,H,I,J,K layer; class L,M note; ``` In a newly created object, many slots are still empty. As you run preprocessing, clustering, dimensionality reduction, and annotation, Seurat progressively fills these slots with new results. To return the names of the layers: ```{r} Layers(S89Inf42) ``` To access a specific count layer use: ```{r} #| eval: false S89Inf42[["RNA"]]$counts |> head() # or S89Inf42@assays$RNA$counts |> head() # or LayerData(S89Inf42, assay = "RNA", layer = "counts") |> head() ``` Because we have just created this object, most of the slots are empty. We will save results back to the object and fill the slots as we work. You can always check the version of Seurat that was used to build the object with: ```{r} S89Inf42@version ``` And check commands and parameters used to generate results with: ```{r} S89Inf42@commands ``` We haven't worked on the object, so there is nothing here. Compare with `pbmc_small`, a built-in dataset in the `SeuratObject` package. ```{r} head(pbmc_small@commands, 1) ``` **This information is useful for documenting your analysis.** #### Examine metadata The metadata in the Seurat object is located in `lung@meta.data` and contains the information associated with each cell. We can access the metadata using: ```{r} head(lung@meta.data) # using head to return only the first 6 rows ``` or ```{r} head(lung[[]]) ``` Accessing a column or columns: ```{r} # Access a single column head(lung$orig.ident) head(lung[["orig.ident"]]) # Access multiple columns and rows head(lung[[c("orig.ident", "nCount_RNA")]])[1:3,] ``` As we can see, there are different ways to access and work with the metadata. In addition, we can see that we begin with built-in information such as the `orig.ident`, `nCount_RNA`, and `nFeature_RNA`. `orig.ident` - defaults to the project labels `nCount_RNA` - number of UMIs per cell `nFeature_RNA` - number of genes / features per cell This information, along with other metrics, will be used for quality filtering. ::: {.callout-note} ## Seurat Idents? Each cell in a Seurat object can be assigned an identity label, and Seurat stores one set of these labels as the active Ident (`active.ident`). These identity labels usually come from metadata categories such as sample ID, cluster membership, condition, or cell type annotation. Many Seurat functions use the active.ident automatically when grouping cells for visualization or analysis.See `?Idents()` and [this vignette](https://satijalab.org/seurat/articles/essential_commands#identity-class-labels){target=_blank} for working with idents. ::: ::: {.callout-note collapse=true} ## Subsetting the object We can subset by features or cells by `lung[1:10,]` or `lung[,1:10]` respectively. To subset using metadata, use `subset()`. For example, if we want to subset to the infected samples only, we can use: ```{r} subset(x = lung, idents = c("S89Inf42", "S99Inf42")) ``` We will look at subsetting more when filtering cells based on quality. ::: #### Add to metadata We can add information to our metadata by accessing and assigning to metadata columns or using `?AddMetaData()`. Add treatment condition to the metadata (Tamoxifen only vs. Tamoxifen + Influenza): ```{r} lung$treatment <- ifelse(str_detect(lung@meta.data$orig.ident, "S91Hom42|S93Hom3"), "Tamoxifen", "Tamoxifen_Influenza") ``` Add time point information to the metadata: ```{r} lung$time_point <- case_when( str_detect(lung@meta.data$orig.ident, "S93Hom3") ~ "Homeostasis_Day3", str_detect(lung@meta.data$orig.ident, "S91Hom42") ~ "Homeostasis_Day42", .default = "Infection_Day42" ) ``` Add treatment + time point to the metadata: ```{r} lung$treatment_tp <- paste(lung$treatment, lung$time_point, sep="_") ``` #### Other useful information from the Seurat object The Seurat v5 object doesn't require all assays have the same cells. In this case, `Cells()` can be used to return the cell names of the default assay while `colnames()` can be used to return the cell names of the entire object. ```{r} head(Cells(lung, layer = "counts.S89Inf42")) ``` ```{r} head(colnames(lung)) ``` To return feature / gene names: ```{r} head(Features(lung)) head(rownames(lung)) ``` For multimodal data, list the assays using: ```{r} Assays(lung) ``` To return the number of cells or features use: ```{r} # Return number of cells across all layers num_cells <- ncol(lung) # Return number of features across all layers num_features <- nrow(lung) # Return row by column information dim(lung) ``` :::{.callout-note} `ncol(lung)` will return the number of cells across all layers. If you want a specific layer, use `ncol(lung[["RNA"]]$counts.S89Inf42)`. This information is also quickly accessible via the Global Environment when using RStudio or by using `str()`. We can also quickly plot the number of cells from the metadata: ```{r} # colors sample_colors <- c( S89Inf42 = "#0072B2", # blue S91Hom42 = "#E69F00", # orange S93Hom3 = "#009E73", # bluish green S99Inf42 = "#CC79A7" # reddish purple ) # Visualize the number of cell counts per sample lung@meta.data %>% ggplot(aes(x = orig.ident, fill = orig.ident)) + geom_bar(color = "black") + stat_count(geom = "text", colour = "black", size = 3.5, aes(label = after_stat(count)), position = position_stack(vjust = 0.5)) + scale_fill_manual(values=sample_colors)+ theme_classic() + theme(plot.title = element_text(hjust = 0.5, face = "bold")) + ggtitle("Number of Cells per Sample") ``` ::: ### Save the object At this point, our Seurat object may be fairly large. It is wise to save this for downstream applications using `saveRDS()`. ```{r} #| eval: false saveRDS(lung, "./outputs/merged_Seurat_lung.rds") ``` ## Other Tutorials of Interest * [Seurat Objects Explained](https://biostatsquid.com/seurat-objects-explained/){target=_blank} ## References The following resources were instrumental in designing this lesson: * [https://www.sc-best-practices.org/introduction/analysis_tools.html#single-cell-analysis-frameworks-and-consortia](https://www.sc-best-practices.org/introduction/analysis_tools.html#single-cell-analysis-frameworks-and-consortia){target=_blank} * [https://satijalab.org/seurat/articles/essential_commands](https://satijalab.org/seurat/articles/essential_commands){target=_blank} * [https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/03_SC_quality_control-setup.md](https://github.com/hbctraining/scRNA-seq_online/blob/master/lessons/03_SC_quality_control-setup.md){target=_blank}