{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "96b9b868",
   "metadata": {},
   "source": [
    "# Single-Cell Multi-omics — Python / scverse walkthrough\n",
    "\n",
    "**Companion notebook for:** *From One Cell to Multiple Molecular Views* (BTEP).\n",
    "**Pairs with:** the R / Seurat + Signac notebook (`multiomics_seurat_signac.Rmd`) — same two examples, same datasets, so you can compare ecosystems.\n",
    "\n",
    "This notebook illustrates the two core **paired** multi-omics workflows on **real 10x Genomics PBMC data** using the [scverse](https://scverse.org) stack:\n",
    "\n",
    "| Part | Example | Modalities | Joint model |\n",
    "|------|---------|------------|-------------|\n",
    "| 1 | CITE-seq | RNA + surface protein (ADT) | `scvi-tools` **totalVI** |\n",
    "| 2 | 10x Multiome | RNA + ATAC | `scvi-tools` **MultiVI** |\n",
    "\n",
    "> **Read me first — this is a teaching reference, not a turnkey pipeline.**\n",
    "> - The cells call the *real* APIs on *real* downloaded data, so running them needs the full stack installed and the datasets downloaded (hundreds of MB each), and model training is much faster on a GPU.\n",
    "> - Treat it as the canonical shape of each workflow. Exact keyword arguments drift between `scvi-tools` releases — when in doubt, follow the linked official tutorial for your installed version.\n",
    "> - Each part is self-contained; run Part 1 or Part 2 independently.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cae34c51",
   "metadata": {},
   "source": [
    "## 0 · Environment\n",
    "\n",
    "Create a fresh environment and install the scverse multi-omics stack. `scvi-tools` pulls in PyTorch.\n",
    "\n",
    "```bash\n",
    "# conda/mamba recommended\n",
    "mamba create -n scverse python=3.11 -y && mamba activate scverse\n",
    "pip install \"scanpy[leiden]\" muon mudata scvi-tools scikit-misc\n",
    "# (optional, ATAC) pip install snapatac2\n",
    "```\n",
    "\n",
    "Versions this notebook was written against: `scanpy>=1.10`, `muon>=0.1.6`, `mudata>=0.3`, `scvi-tools>=1.1`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7984ff0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# If running in a fresh kernel, uncomment to install:\n",
    "# !pip install \"scanpy[leiden]\" muon mudata scvi-tools scikit-misc\n",
    "\n",
    "import numpy as np\n",
    "import scanpy as sc\n",
    "import muon as mu\n",
    "import mudata as md\n",
    "import scvi\n",
    "\n",
    "sc.settings.verbosity = 1\n",
    "scvi.settings.seed = 0\n",
    "print(\"scanpy\", sc.__version__, \"| muon\", mu.__version__,\n",
    "      \"| mudata\", md.__version__, \"| scvi-tools\", scvi.__version__)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b78ea243",
   "metadata": {},
   "source": [
    "---\n",
    "# Part 1 · CITE-seq (RNA + surface protein)\n",
    "\n",
    "**Dataset:** *5k Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor with a panel of TotalSeq-B antibodies (v3 chemistry)* — 10x Genomics.\n",
    "This is the dataset used by the [muon CITE-seq tutorial](https://muon-tutorials.readthedocs.io/en/latest/cite-seq/1-CITE-seq-PBMC-5k.html).\n",
    "\n",
    "**Teaching goal:** build cell states from RNA alone, then from the joint RNA+protein model (totalVI), and see where protein sharpens or *disagrees with* the transcriptome.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "481ba6aa",
   "metadata": {},
   "source": [
    "### 1.1 Download the filtered feature-barcode matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f57ef586",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, urllib.request\n",
    "\n",
    "CITE_H5 = \"5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5\"\n",
    "CITE_URL = (\"https://cf.10xgenomics.com/samples/cell-exp/3.1.0/\"\n",
    "            \"5k_pbmc_protein_v3/5k_pbmc_protein_v3_filtered_feature_bc_matrix.h5\")\n",
    "\n",
    "if not os.path.exists(CITE_H5):\n",
    "    print(\"downloading ~25 MB ...\")\n",
    "    urllib.request.urlretrieve(CITE_URL, CITE_H5)\n",
    "print(\"ready:\", CITE_H5)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "067b9f11",
   "metadata": {},
   "source": [
    "### 1.2 Load into a `MuData` object\n",
    "\n",
    "`mu.read_10x_h5` reads the multimodal h5 and splits the *Gene Expression* and *Antibody Capture* features into two AnnData modalities (`rna`, `prot`) sharing the same cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2435444a",
   "metadata": {},
   "outputs": [],
   "source": [
    "mdata = mu.read_10x_h5(CITE_H5)\n",
    "mdata.var_names_make_unique()\n",
    "print(mdata)\n",
    "\n",
    "rna  = mdata.mod[\"rna\"]\n",
    "prot = mdata.mod[\"prot\"]\n",
    "print(\"\\nRNA :\", rna.shape, \"| Protein (ADT):\", prot.shape)\n",
    "print(\"ADT panel:\", list(prot.var_names[:10]), \"...\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fa3f1ec",
   "metadata": {},
   "source": [
    "### 1.3 QC and basic preprocessing\n",
    "\n",
    "Standard scanpy QC on RNA. We keep a raw **counts** layer because totalVI models counts directly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51e70813",
   "metadata": {},
   "outputs": [],
   "source": [
    "# --- RNA QC ---\n",
    "rna.var[\"mt\"] = rna.var_names.str.startswith(\"MT-\")\n",
    "sc.pp.calculate_qc_metrics(rna, qc_vars=[\"mt\"], inplace=True, percent_top=None)\n",
    "sc.pp.filter_genes(rna, min_cells=3)\n",
    "\n",
    "# filter low-quality / dying cells (tune thresholds to your data)\n",
    "mu.pp.filter_obs(rna, \"n_genes_by_counts\", lambda x: (x >= 200) & (x < 4500))\n",
    "mu.pp.filter_obs(rna, \"pct_counts_mt\", lambda x: x < 20)\n",
    "\n",
    "# keep counts for the generative model, then normalize a copy for plotting\n",
    "rna.layers[\"counts\"] = rna.X.copy()\n",
    "sc.pp.normalize_total(rna, target_sum=1e4)\n",
    "sc.pp.log1p(rna)\n",
    "rna.raw = rna\n",
    "\n",
    "# highly variable genes (totalVI is typically trained on HVGs)\n",
    "sc.pp.highly_variable_genes(rna, n_top_genes=4000, flavor=\"seurat_v3\",\n",
    "                            layer=\"counts\", subset=False)\n",
    "print(rna)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5ba1600",
   "metadata": {},
   "source": [
    "### 1.4 A note on the protein (ADT) modality\n",
    "\n",
    "ADT counts are *signal on top of an ambient antibody background*. Two common choices:\n",
    "- **CLR** (centered log-ratio) — the classic normalization, also what Seurat uses (`NormalizeData(method=\"CLR\")`).\n",
    "- **Model the background** — totalVI does this for you, returning *denoised* protein estimates. We pass raw ADT counts to totalVI below; the CLR version here is only for quick exploratory plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8819804f",
   "metadata": {},
   "outputs": [],
   "source": [
    "prot.layers[\"counts\"] = prot.X.copy()\n",
    "# CLR across cells, just for exploratory visualization\n",
    "mu.prot.pp.clr(prot)              # writes CLR-normalized values into prot.X\n",
    "print(\"ADT after CLR:\", prot.X[:3, :3].toarray() if hasattr(prot.X, \"toarray\") else prot.X[:3, :3])\n",
    "mdata.update()                    # keep MuData's shared obs in sync after filtering\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f47d81b8",
   "metadata": {},
   "source": [
    "### 1.5 RNA-only baseline\n",
    "\n",
    "The 'single-modality' answer we want to improve on: cluster from RNA alone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7da89e4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "rna_only = rna[:, rna.var.highly_variable].copy()\n",
    "sc.pp.scale(rna_only, max_value=10)\n",
    "sc.tl.pca(rna_only, n_comps=30)\n",
    "sc.pp.neighbors(rna_only, n_neighbors=15)\n",
    "sc.tl.umap(rna_only)\n",
    "sc.tl.leiden(rna_only, resolution=1.0, key_added=\"rna_clusters\")\n",
    "rna.obs[\"rna_clusters\"] = rna_only.obs[\"rna_clusters\"].values\n",
    "sc.pl.umap(rna_only, color=\"rna_clusters\", title=\"RNA-only clusters\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cdfa656",
   "metadata": {},
   "source": [
    "### 1.6 Joint RNA + protein with totalVI\n",
    "\n",
    "`totalVI` learns one latent representation from RNA counts **and** ADT counts, while modeling protein background. Reference: [scvi-tools totalVI tutorial](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/multimodal/totalVI.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37cc2cfc",
   "metadata": {},
   "outputs": [],
   "source": [
    "# subset RNA to HVGs for training (totalVI convention)\n",
    "mdata.mod[\"rna\"] = rna[:, rna.var.highly_variable].copy()\n",
    "mdata.mod[\"rna\"].X = mdata.mod[\"rna\"].layers[\"counts\"].copy()   # totalVI wants counts\n",
    "mdata.mod[\"prot\"].X = mdata.mod[\"prot\"].layers[\"counts\"].copy()\n",
    "mdata.update()\n",
    "\n",
    "scvi.model.TOTALVI.setup_mudata(\n",
    "    mdata,\n",
    "    rna_layer=None,           # use .X of the rna modality (raw counts)\n",
    "    protein_layer=None,       # use .X of the prot modality (raw counts)\n",
    "    modalities={\"rna_layer\": \"rna\", \"protein_layer\": \"prot\"},\n",
    ")\n",
    "\n",
    "model = scvi.model.TOTALVI(mdata)\n",
    "model.train(max_epochs=200)    # use a GPU if available; CPU is slow\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "341f66d7",
   "metadata": {},
   "outputs": [],
   "source": [
    "# joint latent representation -> neighbors -> UMAP -> clusters\n",
    "TOTALVI_LATENT = \"X_totalVI\"\n",
    "mdata.obsm[TOTALVI_LATENT] = model.get_latent_representation()\n",
    "\n",
    "# attach to the rna modality for convenient scanpy plotting\n",
    "rna_hvg = mdata.mod[\"rna\"]\n",
    "rna_hvg.obsm[TOTALVI_LATENT] = mdata.obsm[TOTALVI_LATENT]\n",
    "sc.pp.neighbors(rna_hvg, use_rep=TOTALVI_LATENT)\n",
    "sc.tl.umap(rna_hvg)\n",
    "sc.tl.leiden(rna_hvg, resolution=1.0, key_added=\"joint_clusters\")\n",
    "sc.pl.umap(rna_hvg, color=\"joint_clusters\", title=\"totalVI joint clusters\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cab7ee92",
   "metadata": {},
   "source": [
    "### 1.7 RNA vs protein — agreement and discordance\n",
    "\n",
    "totalVI returns **denoised** protein. Compare a surface marker's denoised protein level against its transcript — `CD4` is the classic case where surface protein is clearly present while the mRNA frequently drops out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5b0eefb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# denoised RNA and protein from the model\n",
    "denoised_rna, denoised_prot = model.get_normalized_expression(\n",
    "    n_samples=25, return_mean=True\n",
    ")\n",
    "# denoised_prot is a DataFrame: cells x ADT panel\n",
    "print(\"ADT panel:\", list(denoised_prot.columns))\n",
    "\n",
    "# overlay a few canonical markers on the joint UMAP\n",
    "for adt in [c for c in [\"CD4_TotalSeqB\", \"CD8a_TotalSeqB\", \"CD19_TotalSeqB\"]\n",
    "            if c in denoised_prot.columns]:\n",
    "    rna_hvg.obs[f\"prot_{adt}\"] = denoised_prot[adt].values\n",
    "sc.pl.umap(rna_hvg, color=[c for c in rna_hvg.obs.columns if c.startswith(\"prot_\")],\n",
    "           ncols=3, title=None)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e3b0f60",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Direct RNA-vs-protein discordance for CD4 (transcript vs surface protein)\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def discordance(gene, adt):\n",
    "    if gene not in rna.var_names or adt not in denoised_prot.columns:\n",
    "        print(\"skip\", gene, adt); return\n",
    "    x = np.asarray(rna[:, gene].X.todense()).ravel()[: denoised_prot.shape[0]]\n",
    "    y = denoised_prot[adt].values\n",
    "    plt.figure(figsize=(4, 4))\n",
    "    plt.scatter(x, y, s=3, alpha=0.2)\n",
    "    plt.xlabel(f\"{gene} mRNA (log-norm)\"); plt.ylabel(f\"{adt} protein (denoised)\")\n",
    "    plt.title(f\"{gene}: transcript vs surface protein\"); plt.show()\n",
    "\n",
    "discordance(\"CD4\", \"CD4_TotalSeqB\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95d954e0",
   "metadata": {},
   "source": [
    "**What to point out to the audience**\n",
    "\n",
    "- Coarse cell types (T / B / NK / myeloid) agree between the RNA-only and joint clusterings.\n",
    "- Protein **sharpens** boundaries RNA leaves fuzzy (e.g. CD4 vs CD8 T cells).\n",
    "- `CD4` shows the canonical **discordance**: abundant surface protein, sparse transcript. That's biology (protein half-life, detection limits), not an error to 'correct'.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e4c3ad2",
   "metadata": {},
   "source": [
    "---\n",
    "# Part 2 · 10x Multiome (RNA + ATAC)\n",
    "\n",
    "**Dataset:** *PBMC from a healthy donor — granulocytes removed through cell sorting (10k)*, the 10x Genomics Multiome dataset used by the [Signac vignette](https://stuartlab.org/signac/articles/pbmc_multiomic).\n",
    "\n",
    "**Teaching goal:** one nucleus, two readouts — expressed state (RNA) and accessible regulatory elements (ATAC) — fused into a single representation with MultiVI.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af66086b",
   "metadata": {},
   "source": [
    "### 2.1 Download the multiome data (filtered matrix + ATAC fragments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "625dbc9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "MULTI_DIR = \".\"\n",
    "files = {\n",
    " \"pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5\":\n",
    "   \"https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5\",\n",
    " \"pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz\":\n",
    "   \"https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz\",\n",
    " \"pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi\":\n",
    "   \"https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_atac_fragments.tsv.gz.tbi\",\n",
    "}\n",
    "for fn, url in files.items():\n",
    "    if not os.path.exists(fn):\n",
    "        print(\"downloading\", fn, \"...\")\n",
    "        urllib.request.urlretrieve(url, fn)\n",
    "print(\"multiome files ready\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac5a214d",
   "metadata": {},
   "source": [
    "### 2.2 Load and split into RNA + ATAC modalities\n",
    "\n",
    "`mu.read_10x_h5` again returns a `MuData`; here the modalities are `rna` and `atac` (peaks)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dcca5930",
   "metadata": {},
   "outputs": [],
   "source": [
    "MULTI_H5 = \"pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5\"\n",
    "mdata2 = mu.read_10x_h5(MULTI_H5)\n",
    "mdata2.var_names_make_unique()\n",
    "print(mdata2)\n",
    "\n",
    "rna2  = mdata2.mod[\"rna\"]\n",
    "atac  = mdata2.mod[\"atac\"]\n",
    "print(\"\\nRNA :\", rna2.shape, \"| ATAC peaks:\", atac.shape)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c124fea9",
   "metadata": {},
   "source": [
    "### 2.3 Preprocess RNA (scanpy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2294185c",
   "metadata": {},
   "outputs": [],
   "source": [
    "rna2.var[\"mt\"] = rna2.var_names.str.startswith(\"MT-\")\n",
    "sc.pp.calculate_qc_metrics(rna2, qc_vars=[\"mt\"], inplace=True, percent_top=None)\n",
    "sc.pp.filter_genes(rna2, min_cells=3)\n",
    "mu.pp.filter_obs(rna2, \"n_genes_by_counts\", lambda x: (x >= 500) & (x < 6000))\n",
    "mu.pp.filter_obs(rna2, \"pct_counts_mt\", lambda x: x < 20)\n",
    "\n",
    "rna2.layers[\"counts\"] = rna2.X.copy()\n",
    "sc.pp.normalize_total(rna2, target_sum=1e4); sc.pp.log1p(rna2)\n",
    "sc.pp.highly_variable_genes(rna2, n_top_genes=4000, flavor=\"seurat_v3\",\n",
    "                            layer=\"counts\", subset=False)\n",
    "print(rna2)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4699bae",
   "metadata": {},
   "source": [
    "### 2.4 Preprocess ATAC — TF-IDF + LSI\n",
    "\n",
    "ATAC is near-binary and extremely sparse, so we use **TF-IDF** weighting and **LSI** (latent semantic indexing = SVD on the TF-IDF matrix) rather than PCA on raw counts. muon's ATAC module provides both."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df6d952c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import muon.atac as ac\n",
    "\n",
    "# basic peak QC\n",
    "sc.pp.calculate_qc_metrics(atac, inplace=True, percent_top=None)\n",
    "sc.pp.filter_genes(atac, min_cells=int(0.01 * atac.n_obs))   # peaks in >=1% of cells\n",
    "mu.pp.filter_obs(atac, \"total_counts\", lambda x: (x >= 1000) & (x < 60000))\n",
    "\n",
    "atac.layers[\"counts\"] = atac.X.copy()\n",
    "ac.pp.tfidf(atac, scale_factor=1e4)\n",
    "ac.tl.lsi(atac, n_comps=50)\n",
    "# LSI component 1 usually correlates with depth -> drop it\n",
    "atac.obsm[\"X_lsi\"]      = atac.obsm[\"X_lsi\"][:, 1:]\n",
    "atac.varm[\"LSI\"]        = atac.varm[\"LSI\"][:, 1:]\n",
    "atac.uns[\"lsi\"][\"stdev\"] = atac.uns[\"lsi\"][\"stdev\"][1:]\n",
    "print(atac)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "94c0d4b4",
   "metadata": {},
   "source": [
    "### 2.5 Joint RNA + ATAC with MultiVI\n",
    "\n",
    "MultiVI learns one latent space for paired RNA + ATAC (and gracefully handles cells missing a modality). Reference: [scvi-tools MultiVI tutorial](https://docs.scvi-tools.org/en/stable/tutorials/notebooks/multimodal/MultiVI_tutorial.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc9f5d99",
   "metadata": {},
   "outputs": [],
   "source": [
    "# align cells present in BOTH modalities and feed raw counts\n",
    "common = rna2.obs_names.intersection(atac.obs_names)\n",
    "mdata2.mod[\"rna\"]  = rna2[common, rna2.var.highly_variable].copy()\n",
    "mdata2.mod[\"atac\"] = atac[common].copy()\n",
    "mdata2.mod[\"rna\"].X  = mdata2.mod[\"rna\"].layers[\"counts\"].copy()\n",
    "mdata2.mod[\"atac\"].X = mdata2.mod[\"atac\"].layers[\"counts\"].copy()\n",
    "mdata2.update()\n",
    "\n",
    "scvi.model.MULTIVI.setup_mudata(\n",
    "    mdata2,\n",
    "    modalities={\"rna_layer\": \"rna\", \"atac_layer\": \"atac\"},\n",
    ")\n",
    "mvi = scvi.model.MULTIVI(\n",
    "    mdata2,\n",
    "    n_genes=(mdata2.mod[\"rna\"].var.shape[0]),\n",
    "    n_regions=(mdata2.mod[\"atac\"].var.shape[0]),\n",
    ")\n",
    "mvi.train()      # GPU strongly recommended\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a0cdd6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "mdata2.obsm[\"X_multiVI\"] = mvi.get_latent_representation()\n",
    "joint = mdata2.mod[\"rna\"]\n",
    "joint.obsm[\"X_multiVI\"] = mdata2.obsm[\"X_multiVI\"]\n",
    "sc.pp.neighbors(joint, use_rep=\"X_multiVI\")\n",
    "sc.tl.umap(joint)\n",
    "sc.tl.leiden(joint, resolution=1.0, key_added=\"multiome_clusters\")\n",
    "sc.pl.umap(joint, color=\"multiome_clusters\", title=\"MultiVI joint clusters (RNA+ATAC)\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24d15dc0",
   "metadata": {},
   "source": [
    "### 2.6 Interpreting the regulatory layer\n",
    "\n",
    "Three ATAC readouts — and they are **not** the same thing:\n",
    "\n",
    "| Readout | What it is | Tooling in Python |\n",
    "|---------|------------|-------------------|\n",
    "| **Gene activity** | accessibility summed near a gene, a coarse expression proxy | `snapatac2`, or compute from fragments |\n",
    "| **Motif / TF activity** | which TF motifs are enriched in open regions | `chromVAR`-style; `snapatac2.tl`, `pycisTopic` |\n",
    "| **Peak–gene links** | distal peak ↔ gene correlations | `scenicplus`, `muon`/custom correlation |\n",
    "\n",
    "Accessibility is regulatory **potential**, not proof of activity. For the deepest peak-to-gene / enhancer-driver analysis in Python, look at **SCENIC+**; for ATAC-centric workflows, **SnapATAC2**.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d63bd96",
   "metadata": {},
   "source": [
    "**Takeaways (Part 2)**\n",
    "\n",
    "- Same-nucleus pairing removes cell-type composition as a confounder when you link accessibility to expression.\n",
    "- Cell *types* agree across RNA and ATAC; finer *states* (activation, priming) are where they diverge.\n",
    "- MultiVI gives you the joint embedding; the regulatory interpretation (gene activity, motifs, peak-gene links) is a separate, careful step.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1dff1ba6",
   "metadata": {},
   "source": [
    "---\n",
    "## Caveats & failure modes (keep these in mind)\n",
    "\n",
    "- RNA and protein do **not** always agree — discordance is signal.\n",
    "- ATAC accessibility is **not** regulatory proof.\n",
    "- Integration can **erase** real biology; batch correction can **hallucinate** similarity.\n",
    "- Sparse ATAC is hard to annotate; ADT panels add background and missing-marker problems.\n",
    "- Same-cell multi-omics reduces ambiguity — it does **not** remove experimental artifacts.\n",
    "\n",
    "## References\n",
    "- muon CITE-seq tutorial — https://muon-tutorials.readthedocs.io/en/latest/cite-seq/1-CITE-seq-PBMC-5k.html\n",
    "- scvi-tools totalVI — https://docs.scvi-tools.org/en/stable/tutorials/notebooks/multimodal/totalVI.html\n",
    "- scvi-tools MultiVI — https://docs.scvi-tools.org/en/stable/tutorials/notebooks/multimodal/MultiVI_tutorial.html\n",
    "- muon / MuData — https://muon.readthedocs.io  ·  Single-cell best practices — https://www.sc-best-practices.org\n",
    "- Data: 10x Genomics 5k PBMC protein v3 (CITE-seq) and pbmc_granulocyte_sorted_10k (Multiome).\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}