---
title: "Single-Cell Multi-omics — R / Seurat + Signac walkthrough"
subtitle: "Companion to *From One Cell to Multiple Molecular Views* (BTEP)"
author: "Stefan Cordes, MD, PhD — NHLBI, NIH"
date: "`r format(Sys.Date())`"
output:
  html_document:
    toc: true
    toc_float: true
    code_folding: show
    theme: flatly
---

```{r setup, include=TRUE}
# Code is shown but NOT executed by default, so this document knits to HTML
# without the heavy single-cell stack or the multi-GB datasets installed.
# To actually run a section: install the packages, download the data, and set
# eval = TRUE (globally here, or per-chunk).
knitr::opts_chunk$set(eval = FALSE, message = FALSE, warning = FALSE,
                      fig.width = 6, fig.height = 4)
```

# About this notebook

This is the R companion to the Python / scverse notebook (`multiomics_scverse.ipynb`).
It walks through the two core **paired** multi-omics workflows on the **same real 10x
Genomics PBMC datasets**, using the R ecosystem so you can compare directly:

| Part | Example | Modalities | Engine |
|------|---------|------------|--------|
| 1 | CITE-seq | RNA + surface protein (ADT) | **Seurat**, Weighted Nearest Neighbor (WNN) |
| 2 | 10x Multiome | RNA + ATAC | **Seurat + Signac**, WNN + regulatory analysis |

> **Teaching reference, not a turnkey pipeline.** The chunks call the *real* APIs on
> *real* downloaded data. They need the full stack installed and the datasets in your
> working directory, and some steps (SCTransform, chromVAR) are slow. Treat this as the
> canonical *shape* of each workflow.

## Install (run once)

```{r install}
install.packages(c("Seurat", "Signac", "remotes", "ggplot2", "dplyr", "patchwork"))

# Bioconductor genome / annotation / motif packages used in Part 2
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install(c(
  "EnsDb.Hsapiens.v86", "BSgenome.Hsapiens.UCSC.hg38",
  "JASPAR2020", "TFBSTools", "chromVAR", "motifmatchr"
))

# (optional) SeuratData provides the ready-made CITE-seq teaching dataset 'bmcite'
remotes::install_github("satijalab/seurat-data")
```

---

# Part 1 · CITE-seq (RNA + surface protein)

**Dataset:** *5k PBMCs from a healthy donor with TotalSeq-B antibodies (v3)* — 10x Genomics
(the same file the Python notebook uses). **Goal:** compare RNA-only, protein-only, and the
**WNN** joint representation, and surface RNA/protein *discordance*.

### 1.1 Download

```{r cite-download}
cite_h5 <- "5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5"
if (!file.exists(cite_h5)) {
  download.file(
    paste0("https://cf.10xgenomics.com/samples/cell-exp/3.1.0/",
           "5k_pbmc_protein_v3/", cite_h5),
    destfile = cite_h5, mode = "wb")
}
```

### 1.2 Build a Seurat object with two assays

`Read10X_h5` returns a list with `Gene Expression` and `Antibody Capture`. We put RNA in the
default assay and the ADTs in a second `ADT` assay on the same cells.

```{r cite-build}
library(Seurat)
library(patchwork)

data <- Read10X_h5(cite_h5)
pbmc <- CreateSeuratObject(counts = data[["Gene Expression"]])
pbmc[["ADT"]] <- CreateAssayObject(counts = data[["Antibody Capture"]])

# basic RNA QC
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 4500 & percent.mt < 20)
Assays(pbmc)        # "RNA"  "ADT"
```

### 1.3 Normalize each modality on its own terms

RNA gets log-normalization; ADT gets **CLR** (centered log-ratio, `margin = 2` = across cells).
This is the single most important "different noise models" point from the talk.

```{r cite-normalize}
# --- RNA ---
DefaultAssay(pbmc) <- "RNA"
pbmc <- NormalizeData(pbmc) |>
        FindVariableFeatures() |>
        ScaleData() |>
        RunPCA()

# --- ADT (protein) ---
DefaultAssay(pbmc) <- "ADT"
VariableFeatures(pbmc) <- rownames(pbmc[["ADT"]])     # use the whole (small) panel
pbmc <- NormalizeData(pbmc, normalization.method = "CLR", margin = 2) |>
        ScaleData() |>
        RunPCA(reduction.name = "apca")               # a separate PCA for protein
```

### 1.4 Fuse with Weighted Nearest Neighbor (WNN)

`FindMultiModalNeighbors` learns a **per-cell weight** that decides how much each cell should
trust RNA vs protein, then builds one integrated graph.

```{r cite-wnn}
pbmc <- FindMultiModalNeighbors(
  pbmc, reduction.list = list("pca", "apca"),
  dims.list = list(1:30, 1:18), modality.weight.name = "RNA.weight")

pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn",
                reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, resolution = 1.0)
```

### 1.5 Compare the three views side by side

```{r cite-compare}
# single-modality UMAPs for comparison
pbmc <- RunUMAP(pbmc, reduction = "pca",  dims = 1:30,
                reduction.name = "rna.umap", reduction.key = "rnaUMAP_")
pbmc <- RunUMAP(pbmc, reduction = "apca", dims = 1:18,
                reduction.name = "adt.umap", reduction.key = "adtUMAP_")

p1 <- DimPlot(pbmc, reduction = "rna.umap", label = TRUE) + ggtitle("RNA only")
p2 <- DimPlot(pbmc, reduction = "adt.umap", label = TRUE) + ggtitle("Protein (ADT) only")
p3 <- DimPlot(pbmc, reduction = "wnn.umap", label = TRUE) + ggtitle("WNN (joint)")
p1 + p2 + p3
```

### 1.6 RNA vs protein discordance

`CD4` is the classic example: surface protein is clearly present while the transcript
frequently drops out. Put the transcript and the ADT next to each other.

```{r cite-discordance}
# ADT feature names vary by panel/version — check them first, then pick the CD4 tag:
rownames(pbmc[["ADT"]])
cd4_adt <- grep("CD4", rownames(pbmc[["ADT"]]), value = TRUE)[1]   # e.g. "CD4" or "CD4.1"

DefaultAssay(pbmc) <- "RNA"
fp_rna  <- FeaturePlot(pbmc, "CD4", reduction = "wnn.umap") + ggtitle("CD4 mRNA")
DefaultAssay(pbmc) <- "ADT"
fp_prot <- FeaturePlot(pbmc, cd4_adt, reduction = "wnn.umap") + ggtitle("CD4 protein")
fp_rna + fp_prot

# Per-cell modality weight: where does protein 'win'?
VlnPlot(pbmc, "RNA.weight", group.by = "seurat_clusters", pt.size = 0) + NoLegend()
```

> **No-download alternative.** The official Seurat WNN vignette uses the bone-marrow CITE-seq
> dataset `bmcite` from *SeuratData*, which needs no manual download:
>
> ```r
> library(SeuratData); InstallData("bmcite"); bm <- LoadData("bmcite")
> # then the same Normalize -> PCA/apca -> FindMultiModalNeighbors recipe as above
> ```

**What to emphasize:** coarse cell types agree across modalities; protein sharpens the fuzzy
boundaries (CD4 vs CD8 T); and CD4 discordance is *biology* (protein half-life, mRNA dropout),
not noise to correct away.

---

# Part 2 · 10x Multiome (RNA + ATAC) with Signac

**Dataset:** *PBMC, granulocytes sorted out (10k)* — the 10x Multiome dataset from the
[Signac vignette](https://stuartlab.org/signac/articles/pbmc_multiomic). **Goal:** fuse
expression and accessibility from the same nucleus, then read out the regulatory layer
(gene activity, peak–gene links, TF motifs).

### 2.1 Download (filtered matrix + ATAC fragments + index)

```{r multi-download}
base <- paste0("https://cf.10xgenomics.com/samples/cell-arc/1.0.0/",
               "pbmc_granulocyte_sorted_10k/")
files <- c("pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5",
           "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz",
           "pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi")
for (f in files) if (!file.exists(f)) download.file(paste0(base, f), f, mode = "wb")
```

### 2.2 Build RNA assay + ChromatinAssay

The ATAC modality becomes a `ChromatinAssay` that knows about the fragment file and gene
annotation — that link is what powers gene activity, coverage plots, and peak–gene linkage.

```{r multi-build}
library(Signac)
library(Seurat)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38)

counts   <- Read10X_h5(files[1])
fragpath <- files[2]

# gene annotation (hg38 / UCSC style)
annotation <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
seqlevelsStyle(annotation) <- "UCSC"
genome(annotation) <- "hg38"

pbmc <- CreateSeuratObject(counts = counts$`Gene Expression`, assay = "RNA")
pbmc[["ATAC"]] <- CreateChromatinAssay(
  counts = counts$Peaks, sep = c(":", "-"),
  fragments = fragpath, annotation = annotation)
```

### 2.3 ATAC-specific QC

ATAC needs its own quality metrics — **TSS enrichment**, **nucleosome signal**, fragment
counts (FRiP if you have per-cell totals), and mitochondrial reads on the RNA side.

```{r multi-qc}
DefaultAssay(pbmc) <- "ATAC"
pbmc <- NucleosomeSignal(pbmc)
pbmc <- TSSEnrichment(pbmc)
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, assay = "RNA", pattern = "^MT-")

pbmc <- subset(pbmc, subset =
  nCount_ATAC < 1e5 & nCount_ATAC > 1000 &
  nCount_RNA  < 25000 & nCount_RNA  > 1000 &
  nucleosome_signal < 2 & TSS.enrichment > 1 & percent.mt < 20)
```

### 2.4 Reduce each modality — PCA for RNA, LSI for ATAC

RNA uses SCTransform + PCA. ATAC uses **TF-IDF + SVD (LSI)** because the peak matrix is
near-binary and sparse — PCA on raw counts is the wrong tool.

```{r multi-reduce}
# RNA
DefaultAssay(pbmc) <- "RNA"
pbmc <- SCTransform(pbmc) |> RunPCA()

# ATAC
DefaultAssay(pbmc) <- "ATAC"
pbmc <- FindTopFeatures(pbmc, min.cutoff = 5)
pbmc <- RunTFIDF(pbmc)
pbmc <- RunSVD(pbmc)                 # LSI lives in the 'lsi' reduction
```

### 2.5 Fuse with WNN

```{r multi-wnn}
pbmc <- FindMultiModalNeighbors(
  pbmc, reduction.list = list("pca", "lsi"),
  dims.list = list(1:50, 2:50))      # LSI dim 1 ~ sequencing depth, so drop it

pbmc <- RunUMAP(pbmc, nn.name = "weighted.nn",
                reduction.name = "wnn.umap", reduction.key = "wnnUMAP_")
pbmc <- FindClusters(pbmc, graph.name = "wsnn", algorithm = 3, resolution = 1.0)
DimPlot(pbmc, reduction = "wnn.umap", label = TRUE) + ggtitle("RNA + ATAC (WNN)")
```

### 2.6 The regulatory layer — three different readouts

```{r multi-geneactivity}
# (a) GENE ACTIVITY: accessibility near a gene as a coarse expression proxy
gene.activities <- GeneActivity(pbmc)
pbmc[["ACTIVITY"]] <- CreateAssayObject(counts = gene.activities)
pbmc <- NormalizeData(pbmc, assay = "ACTIVITY")
```

```{r multi-linkpeaks}
# (b) PEAK-GENE LINKS: correlate distal peaks with gene expression
DefaultAssay(pbmc) <- "ATAC"
pbmc <- RegionStats(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38)
pbmc <- LinkPeaks(pbmc, peak.assay = "ATAC", expression.assay = "SCT",
                  genes.use = c("CD8A", "MS4A1", "LYZ"))
CoveragePlot(pbmc, region = "MS4A1", features = "MS4A1",
             expression.assay = "SCT", extend.upstream = 5000, extend.downstream = 5000)
```

```{r multi-motifs}
# (c) TF MOTIF ACTIVITY: chromVAR over JASPAR motifs
library(JASPAR2020); library(TFBSTools); library(chromVAR); library(motifmatchr)
pfm <- getMatrixSet(JASPAR2020, opts = list(collection = "CORE", tax_group = "vertebrates"))
pbmc <- AddMotifs(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38, pfm = pfm)
pbmc <- RunChromVAR(pbmc, genome = BSgenome.Hsapiens.UCSC.hg38)
DefaultAssay(pbmc) <- "chromvar"
FeaturePlot(pbmc, features = "MA0080.5", reduction = "wnn.umap") + ggtitle("SPI1 (PU.1) motif activity")
```

> **Keep the three apart:** gene activity is a *proxy* for expression, motif activity is a *TF
> program*, peak–gene links are *correlations*. None of them is proof of regulation —
> accessibility is regulatory **potential**.

---

# Caveats & failure modes

- RNA and protein do **not** always agree — discordance is signal.
- ATAC accessibility is **not** regulatory proof.
- Integration can **erase** real biology; batch correction can **hallucinate** similarity.
- Sparse ATAC is hard to annotate; ADT panels add background and missing-marker problems.
- Same-cell multi-omics reduces ambiguity — it does **not** remove experimental artifacts.

# References

- Seurat WNN multimodal vignette — <https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis>
- Signac "Joint RNA + ATAC" vignette — <https://stuartlab.org/signac/articles/pbmc_multiomic>
- Seurat v5 bridge integration (unpaired) — <https://satijalab.org/seurat/articles/seurat5_integration_bridge>
- Data: 10x Genomics 5k PBMC protein v3 (CITE-seq) and pbmc_granulocyte_sorted_10k (Multiome).

```{r session}
sessionInfo()
```