--- title: "Single-Cell Multi-omics — R / Seurat + Signac walkthrough" subtitle: "Companion to *From One Cell to Multiple Molecular Views* (BTEP)" author: "Stefan Cordes, MD, PhD — NHLBI, NIH" date: "`r format(Sys.Date())`" output: html_document: toc: true toc_float: true code_folding: show theme: flatly --- ```{r setup, include=TRUE} # Code is shown but NOT executed by default, so this document knits to HTML # without the heavy single-cell stack or the multi-GB datasets installed. # To actually run a section: install the packages, download the data, and set # eval = TRUE (globally here, or per-chunk). knitr::opts_chunk$set(eval = FALSE, message = FALSE, warning = FALSE, fig.width = 6, fig.height = 4) ``` # About this notebook This is the R companion to the Python / scverse notebook (`multiomics_scverse.ipynb`). It walks through the two core **paired** multi-omics workflows on the **same real 10x Genomics PBMC datasets**, using the R ecosystem so you can compare directly: | Part | Example | Modalities | Engine | |------|---------|------------|--------| | 1 | CITE-seq | RNA + surface protein (ADT) | **Seurat**, Weighted Nearest Neighbor (WNN) | | 2 | 10x Multiome | RNA + ATAC | **Seurat + Signac**, WNN + regulatory analysis | > **Teaching reference, not a turnkey pipeline.** The chunks call the *real* APIs on > *real* downloaded data. They need the full stack installed and the datasets in your > working directory, and some steps (SCTransform, chromVAR) are slow. Treat this as the > canonical *shape* of each workflow. ## Install (run once) ```{r install} install.packages(c("Seurat", "Signac", "remotes", "ggplot2", "dplyr", "patchwork")) # Bioconductor genome / annotation / motif packages used in Part 2 if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c( "EnsDb.Hsapiens.v86", "BSgenome.Hsapiens.UCSC.hg38", "JASPAR2020", "TFBSTools", "chromVAR", "motifmatchr" )) # (optional) SeuratData provides the ready-made CITE-seq teaching dataset 'bmcite' remotes::install_github("satijalab/seurat-data") ``` --- # Part 1 · CITE-seq (RNA + surface protein) **Dataset:** *5k PBMCs from a healthy donor with TotalSeq-B antibodies (v3)* — 10x Genomics (the same file the Python notebook uses). **Goal:** compare RNA-only, protein-only, and the **WNN** joint representation, and surface RNA/protein *discordance*. ### 1.1 Download ```{r cite-download} cite_h5 <- "5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5" if (!file.exists(cite_h5)) { download.file( paste0("https://cf.10xgenomics.com/samples/cell-exp/3.1.0/", "5k_pbmc_protein_v3/", cite_h5), destfile = cite_h5, mode = "wb") } ``` ### 1.2 Build a Seurat object with two assays `Read10X_h5` returns a list with `Gene Expression` and `Antibody Capture`. We put RNA in the default assay and the ADTs in a second `ADT` assay on the same cells. ```{r cite-build} library(Seurat) library(patchwork) data <- Read10X_h5(cite_h5) pbmc <- CreateSeuratObject(counts = data[["Gene Expression"]]) pbmc[["ADT"]] <- CreateAssayObject(counts = data[["Antibody Capture"]]) # basic RNA QC pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-") pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 4500 & percent.mt < 20) Assays(pbmc) # "RNA" "ADT" ``` ### 1.3 Normalize each modality on its own terms RNA gets log-normalization; ADT gets **CLR** (centered log-ratio, `margin = 2` = across cells). This is the single most important "different noise models" point from the talk. ```{r cite-normalize} # --- RNA --- DefaultAssay(pbmc) <- "RNA" pbmc <- NormalizeData(pbmc) |> FindVariableFeatures() |> ScaleData() |> RunPCA() # --- ADT (protein) --- DefaultAssay(pbmc) <- "ADT" VariableFeatures(pbmc) <- rownames(pbmc[["ADT"]]) # use the whole (small) panel pbmc <- NormalizeData(pbmc, normalization.method = "CLR", margin = 2) |> ScaleData() |> RunPCA(reduction.name = "apca") # a separate PCA for protein ``` ### 1.4 Fuse with Weighted Nearest Neighbor (WNN) `FindMultiModalNeighbors` learns a **per-cell weight** that decides how much each cell should trust RNA vs protein, then builds one integrated graph. ```{r cite-wnn} pbmc <- FindMultiModalNeighbors( pbmc, reduction.list = list("pca", "apca"), dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight") pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_") pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, resolution = 1.0) ``` ### 1.5 Compare the three views side by side ```{r cite-compare} # single-modality UMAPs for comparison pbmc <- RunUMAP(pbmc, reduction = "pca", dims = 1:30, reduction.name = "rna.umap", reduction.key = "rnaUMAP_") pbmc <- RunUMAP(pbmc, reduction = "apca", dims = 1:18, reduction.name = "adt.umap", reduction.key = "adtUMAP_") p1 <- DimPlot(pbmc, reduction = "rna.umap", label = TRUE) + ggtitle("RNA only") p2 <- DimPlot(pbmc, reduction = "adt.umap", label = TRUE) + ggtitle("Protein (ADT) only") p3 <- DimPlot(pbmc, reduction = "wnn.umap", label = TRUE) + ggtitle("WNN (joint)") p1 + p2 + p3 ``` ### 1.6 RNA vs protein discordance `CD4` is the classic example: surface protein is clearly present while the transcript frequently drops out. Put the transcript and the ADT next to each other. ```{r cite-discordance} # ADT feature names vary by panel/version — check them first, then pick the CD4 tag: rownames(pbmc[["ADT"]]) cd4_adt <- grep("CD4", rownames(pbmc[["ADT"]]), value = TRUE)[1] # e.g. "CD4" or "CD4.1" DefaultAssay(pbmc) <- "RNA" fp_rna <- FeaturePlot(pbmc, "CD4", reduction = "wnn.umap") + ggtitle("CD4 mRNA") DefaultAssay(pbmc) <- "ADT" fp_prot <- FeaturePlot(pbmc, cd4_adt, reduction = "wnn.umap") + ggtitle("CD4 protein") fp_rna + fp_prot # Per-cell modality weight: where does protein 'win'? VlnPlot(pbmc, "RNA.weight", group.by = "seurat_clusters", pt.size = 0) + NoLegend() ``` > **No-download alternative.** The official Seurat WNN vignette uses the bone-marrow CITE-seq > dataset `bmcite` from *SeuratData*, which needs no manual download: > > ```r > library(SeuratData); InstallData("bmcite"); bm <- LoadData("bmcite") > # then the same Normalize -> PCA/apca -> FindMultiModalNeighbors recipe as above > ``` **What to emphasize:** coarse cell types agree across modalities; protein sharpens the fuzzy boundaries (CD4 vs CD8 T); and CD4 discordance is *biology* (protein half-life, mRNA dropout), not noise to correct away. --- # Part 2 · 10x Multiome (RNA + ATAC) with Signac **Dataset:** *PBMC, granulocytes sorted out (10k)* — the 10x Multiome dataset from the [Signac vignette](https://stuartlab.org/signac/articles/pbmc_multiomic). **Goal:** fuse expression and accessibility from the same nucleus, then read out the regulatory layer (gene activity, peak–gene links, TF motifs). ### 2.1 Download (filtered matrix + ATAC fragments + index) ```{r multi-download} base <- paste0("https://cf.10xgenomics.com/samples/cell-arc/1.0.0/", "pbmc_granulocyte_sorted_10k/") files <- c("pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5", "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz", "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi") for (f in files) if (!file.exists(f)) download.file(paste0(base, f), f, mode = "wb") ``` ### 2.2 Build RNA assay + ChromatinAssay The ATAC modality becomes a `ChromatinAssay` that knows about the fragment file and gene annotation — that link is what powers gene activity, coverage plots, and peak–gene linkage. ```{r multi-build} library(Signac) library(Seurat) library(EnsDb.Hsapiens.v86) library(BSgenome.Hsapiens.UCSC.hg38) counts <- Read10X_h5(files[1]) fragpath <- files[2] # gene annotation (hg38 / UCSC style) annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86) seqlevelsStyle(annotation) <- "UCSC" genome(annotation) <- "hg38" pbmc <- CreateSeuratObject(counts = counts$`Gene Expression`, assay = "RNA") pbmc[["ATAC"]] <- CreateChromatinAssay( counts = counts$Peaks, sep = c(":", "-"), fragments = fragpath, annotation = annotation) ``` ### 2.3 ATAC-specific QC ATAC needs its own quality metrics — **TSS enrichment**, **nucleosome signal**, fragment counts (FRiP if you have per-cell totals), and mitochondrial reads on the RNA side. ```{r multi-qc} DefaultAssay(pbmc) <- "ATAC" pbmc <- NucleosomeSignal(pbmc) pbmc <- TSSEnrichment(pbmc) pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, assay = "RNA", pattern = "^MT-") pbmc <- subset(pbmc, subset = nCount_ATAC < 1e5 & nCount_ATAC > 1000 & nCount_RNA < 25000 & nCount_RNA > 1000 & nucleosome_signal < 2 & TSS.enrichment > 1 & percent.mt < 20) ``` ### 2.4 Reduce each modality — PCA for RNA, LSI for ATAC RNA uses SCTransform + PCA. ATAC uses **TF-IDF + SVD (LSI)** because the peak matrix is near-binary and sparse — PCA on raw counts is the wrong tool. ```{r multi-reduce} # RNA DefaultAssay(pbmc) <- "RNA" pbmc <- SCTransform(pbmc) |> RunPCA() # ATAC DefaultAssay(pbmc) <- "ATAC" pbmc <- FindTopFeatures(pbmc, min.cutoff = 5) pbmc <- RunTFIDF(pbmc) pbmc <- RunSVD(pbmc) # LSI lives in the 'lsi' reduction ``` ### 2.5 Fuse with WNN ```{r multi-wnn} pbmc <- FindMultiModalNeighbors( pbmc, reduction.list = list("pca", "lsi"), dims.list = list(1:50, 2:50)) # LSI dim 1 ~ sequencing depth, so drop it pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn", reduction.name = "wnn.umap", reduction.key = "wnnUMAP_") pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, resolution = 1.0) DimPlot(pbmc, reduction = "wnn.umap", label = TRUE) + ggtitle("RNA + ATAC (WNN)") ``` ### 2.6 The regulatory layer — three different readouts ```{r multi-geneactivity} # (a) GENE ACTIVITY: accessibility near a gene as a coarse expression proxy gene.activities <- GeneActivity(pbmc) pbmc[["ACTIVITY"]] <- CreateAssayObject(counts = gene.activities) pbmc <- NormalizeData(pbmc, assay = "ACTIVITY") ``` ```{r multi-linkpeaks} # (b) PEAK-GENE LINKS: correlate distal peaks with gene expression DefaultAssay(pbmc) <- "ATAC" pbmc <- RegionStats(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38) pbmc <- LinkPeaks(pbmc, peak.assay = "ATAC", expression.assay = "SCT", genes.use = c("CD8A", "MS4A1", "LYZ")) CoveragePlot(pbmc, region = "MS4A1", features = "MS4A1", expression.assay = "SCT", extend.upstream = 5000, extend.downstream = 5000) ``` ```{r multi-motifs} # (c) TF MOTIF ACTIVITY: chromVAR over JASPAR motifs library(JASPAR2020); library(TFBSTools); library(chromVAR); library(motifmatchr) pfm <- getMatrixSet(JASPAR2020, opts = list(collection = "CORE", tax_group = "vertebrates")) pbmc <- AddMotifs(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38, pfm = pfm) pbmc <- RunChromVAR(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38) DefaultAssay(pbmc) <- "chromvar" FeaturePlot(pbmc, features = "MA0080.5", reduction = "wnn.umap") + ggtitle("SPI1 (PU.1) motif activity") ``` > **Keep the three apart:** gene activity is a *proxy* for expression, motif activity is a *TF > program*, peak–gene links are *correlations*. None of them is proof of regulation — > accessibility is regulatory **potential**. --- # Caveats & failure modes - RNA and protein do **not** always agree — discordance is signal. - ATAC accessibility is **not** regulatory proof. - Integration can **erase** real biology; batch correction can **hallucinate** similarity. - Sparse ATAC is hard to annotate; ADT panels add background and missing-marker problems. - Same-cell multi-omics reduces ambiguity — it does **not** remove experimental artifacts. # References - Seurat WNN multimodal vignette — - Signac "Joint RNA + ATAC" vignette — - Seurat v5 bridge integration (unpaired) — - Data: 10x Genomics 5k PBMC protein v3 (CITE-seq) and pbmc_granulocyte_sorted_10k (Multiome). ```{r session} sessionInfo() ```