ncibtep@nih.gov

Bioinformatics Training and Education Program

BTEP Question Forum

BTEP maintains several Question and Answer Forums of interest to the NCI/CCR community.
Currently, there are forums on these topics listed below:

If you wish to ask a question go to the Ask Question Page and submit your question.

 Back to Questions

When should we batch correct scRNA-Seq data? When should we avoid it?

When should we 'batch correct' scRNA-Seq data? When should we avoid it? How do we know when it is appropriate and when it isn't? Is there a difference between 'batch correction' as employed for bulk RNA-Seq and those for scRNA-Seq? Should we distinguish between 'batch correction' and 'dataset alignment'? When is one appropriate over the other? What are the tell-tale signs of an 'overcorrected' batch correction? Do underlying counts matrices ever get modified, or it the modification only for the reduced dimension matrix and resulting clustering and visualizations? Is downstream differential expression testing or things like trajectory modeling performed on the modified or original matrix? 

2 Answers:

2

The major sources of batch effects arise from samples with significantly different sequencing depth and saturation, varying sequencing instruments (e.g., MiSeq, NextSeq, and HiSeq) and technologies (e.g., Chromium and SMART-seq2). These sources of technical variation can mask the biological variation among the samples and typically require batch correction. A practical way to observe potential batch effects and the impact of batch correction is to visualize the cell groups on a t-SNE or UMAP plot by labelling cells in terms of their sample group (e.g., case/control) and batch number before and after the batch correction. If there are multiple groups of samples in the data, such as WT/KO or stimulated/control, this would typically create a separation between the sample groups (mainly driven by the biological signal) in your UMAP or t-SNE. After the batch correction, one can expect to see a greater between-group separation of cells (e.g., WT and KO separated at least in some regions of the UMAP) than the within-group separation (e.g., subpopulations among WT cells). These are overall trends that would be expected even though some rare subpopulations can still get clustered distantly from other cells in the same sample type.

Some of the tell-tale signs of an ‘overcorrected’ batch correction are the following: (1) Significant fraction of the cluster-specific markers are composed of genes highly expressed globally in many cell types (e.g., ribosomal genes), (2) significant overlap between the cluster-specific markers, (3) lack of cluster-specific markers you would expect to observe (e.g., canonical markers of a certain T-cell subtype known to be present in the data set) or (4) lack of differential expression hits associated with pathways you expect to see based on the composition of your samples in terms of cell types and experimental conditions. These types of outcomes would suggest that the separation between the clusters may be driven by very few genes on the UMAP. In such cases, one can reconsider using a simpler/less aggressive approach for batch correction, such as a linear method (e.g., Combat) instead of a non-linear method (e.g., mutual nearest neighbors). One can also consider the no batch correction option if strong batch effects aren't observed in the t-SNE or UMAP and major sources of batch effects (e.g., samples with significant differences in sequencing depth and saturation) don't apply to your data set.

In the standard Seurat workflow, the alignment of multiple samples (sample integration aiming to address the batch effects) takes place in the reduced-dimensional space (scaled data) obtained through a dimensionality reduction approach (e.g., PCA followed by implementing t-SNE or UMAP). This approach also allows users to regress out certain elements such as cell cycle scores, UMI counts, percent mitochondrial gene expression in the reduced space. Here, the underlying counts matrices do not get modified. In contrast, differential expression and trajectory modeling are performed on the normalized count matrix, not in the reduced space.

There are also ways to address potential batch effects from different sequencing runs prior to generating raw count matrices. CellRanger's sample aggregation offers the option for count matrix generation by subsampling of the reads in some samples for depth normalization in addition to batch correction with mutual nearest neighbors. Compared to the Seurat alignment algorithm (each sample first processed separately, then all samples aligned/integrated), CellRanger's approach is more of a pure batch correction step. Therefore, it's definitely helpful to distinguish between approaches that only do batch correction versus sample alignment/integration that takes place in the reduced dimensional space.


Answered on July 23rd, 2020 by

I'll try to address the questions individually where I can:

When should we ‘batch correct’ scRNA-Seq data? When should we avoid it? How do we know when it is appropriate and when it isn’t?

Batch correction should be checked for each scRNA-Seq experiment. The general approach is to apply your existing knowledge of the experiment and determine if the projection profile (e.g. from UMAP or T-SNE) resulting from batch-corrected or uncorrected data is more representative of the experiment. This generally takes the form of how similar the individual samples are expected to be, and how distinct the samples appear in the projections. It is recommended to use batch correction only if necessary; if batch correction does not appear to have made a significant difference in the projection, consider using the uncorrected data. There are other experimental techniques to limit the need for batch correction, or identify when it is necessary, such as cell hashing, which allows for multiple samples to be processed in the same run, or cell spike-ins, which use a distinct cell type and should overlap between samples.

Is there a difference between ‘batch correction’ as employed for bulk RNA-Seq and those for scRNA-Seq?

Batch correction, in theory, exists for the same reason between bulk RNA-Seq and scRNA-Seq: A variation in experimental design, such as collection time, runtime, or even a different box of pipette tips, may have left a residual effect on the samples that appears in the sequencing results. The role of batch correction is to identify these variations and mitigate their effects. The difference in ‘batch correction’ between the two sequencing methods is mostly algorithmic. Some of the techniques used in bulk RNA-Seq may be unable to correct batch effects in scRNA-Seq as a result of data size (10k cells vs 10 total samples) or data sparsity; likewise, scRNA-Seq techniques may be overkill for the smaller experimental design associated with bulk RNA-Seq.

Should we distinguish between ‘batch correction’ and ‘dataset alignment’? When is one appropriate over the other?

I don't know how to address this question. Sorry! I do believe that Cihan answered it above, in the context of batch correction at the earlier stages of CellRanger alignment and within the dataspace.

What are the tell-tale signs of an ‘overcorrected’ batch correction?

The tell-tale signs of an overcorrected batch correction are a complete overlap of the samples. This tends to occur with very similar samples with minor differences driving the experimental design. See the attached image for an example of overcorrected data.

Do underlying counts matrices ever get modified, or it the modification only for the reduced dimension matrix and resulting clustering and visualizations? Is downstream differential expression testing or things like trajectory modeling performed on the modified or original matrix? 

The batch correction technique in Seurat, which is based on canonical correlation analysis (CCA) and influenced by mutual nearest neighbors (MNN), creates a new data assay in the Seurat object, typically named “integrated.” This is a subset of the entire counts matrix that is based on a fixed number of “anchor” genes, which tends to consist of the most variant genes in the dataset. The underlying counts matrices are unaffected, and within the Seurat object are named “RNA,” and after SCTransform normalization, “SCT.” The batch correction/integration is primarily used to improve clustering and cell definition. Under these batch-corrected cluster identities, downstream differential expression should be only performed using either the normalized counts or the raw counts. Conducting differential expression or other downstream analyses on the “integrated” dataset will be based only on the subset of genes used to perform the batch correction, which loses much of the more subtle variance that would be captured otherwise.


Answered on July 23rd, 2020 by