Bioinformatics Training and Education Program

BTEP Question Forum

BTEP maintains several Question and Answer Forums of interest to the NCI/CCR community.
Currently, there are forums on these topics listed below:

If you wish to ask a question go to the Ask Question Page and submit your question.

 Back to Questions

Can you give some suggestions cell identity annoation and recommend packages good at cell identity annotation? What is the most used practice, manually or auto?

The question from Lirong

2 Answers:


This question overlaps with a previously submitted set of questions in the forum. Unless you have a single marker that can surely distinguish a cell type from others, automated cell type identification is preferred over manual approaches. If you want to leverage the whole transcriptome in each cell against reference databases, a very widely used and versatile R package for cell type identification is SingleR. This package utilizes the Spearman correlation values between the transcriptome of each cell (gene expression levels in your data) and the reference transcriptome of each cell type from different databases, such as ImmGen (for mouse) or Human Primary Cell Atlas and Blueprint+ENCODE consortium (combined) data sets (for humans). For each cell in your data set, SingleR assigns the highest scoring cell type as the predicted cell type. SingleR also offers options to check for the robustness of these predictions and remove low quality labels when multiple cell types have score similarly for some cells in your data.

For other data sets, such as Tabula Muris, one can use Seurat’s reference-based sample integration and label transfer approach. With this approach, the reference data set is used as a guide for sample integration. Seurat also provides an additional option for cell type identification with its AddModuleScore function. This approach is implemented by providing gene sets characteristic of different cell types and letting Seurat compute a score for each cell type for all cells in the data. Using this approach, the highest scoring cell type (per cell) is assigned as the cell type. Seurat actually uses the very same AddModuleScore function for mapping cells to different cell cycle phases by utilizing the canonical markers of G1, S, G2/M phases.

Answered on July 27th, 2020 by

While the automated tools seem to be getting better and better (in part because the reference datasets and annotations are improving), we always recommend subject matter experts go in and confirm or validate. Many times folks will just end up doing the annotations manually.

Answered on July 29th, 2020 by