Skip to content

VLOOKUP in excel and the R programming equivalent

Introducing the BTEP Coding Club

The BTEP Coding club is a new initiative to provide more tailored bioinformatics training to the NCI community. Each month we will feature a 1-hour demo / tutorial of a bioinformatics tool, software, skill, or platform.

We welcome suggestions from the NCI community. Email us at ncibtep@nih.gov if there is a specific topic you would like to see featured.

Learning Objectives.

  1. Learn how to use the VLOOKUP function.
  2. Discuss advantages and disadvantages of using Excel.
  3. Demonstrate similar VLOOKUP functionality with R programming.
  4. Discuss advantages / disadvantages of using R and resources to continue R learning.

What is VLOOKUP?

VLOOKUP is an excel function that performs vertical look-ups, or look up by row across columns.

The two main purposes of VLOOKUP are to:

  1. find a result in your data by using a key in another column and
  2. to merge tables.

Info

In 2019, Excel introduced the function XLOOKUP to succeed VLOOKUP and HLOOKUP. XLOOKUP works similarly but gets around some of the annoying behaviors of the other functions. Learn more about XLOOKUP here.

Given VLOOKUP's popularity, we chose to focus on it, but if you are new to VLOOKUP consider checking out XLOOKUP.

Why use excel?

The use of excel can be advantageous. Excel provides a point-and-click (GUI) environment for users to interact with the program. This eliminates many of the hurdles that are often associated with more complicated command line tools, including R programming. For the most part, excel does not require you to learn complicated syntax to perform a task.

Why avoid excel?

For simple use cases, Excel is a great resource. For example, if you only have small subsets of data you are interested in looking up or merging, excel is likely perfect. However, the more complicated the data wrangling becomes, the less likely Excel will be useful. Documenting data wrangling steps is incredibly important for reproducibility and collaboration. Unlike scripting languages, Excel does not document the steps you take to reshape, transform, or summarize your data.

Warning

Excel will autocorrect imported text fields, which has led to embarrassingly incorrect gene names permeating the literature. Such autocorrection mistakes can be avoided by pre-formatting columns during excel import.

How to use VLOOKUP?

The formula:

VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup])

In its simplest form, the VLOOKUP function says:

=VLOOKUP(What you want to look up, where you want to look for it, the column number in the range containing the value to return, return an Approximate or Exact match – indicated as 1/TRUE, or 0/FALSE). --- support.microsoft.com

Based on this, we need the following four components:

  1. lookup_value = The value we want to look-up / match.
  2. table_array = The range of values containing the look-up value and the value we want returned. The look-up value must be in the FIRST COLUMN of our table_array.
  3. col_index_num = The column number that includes our return value. The first column in the table_array will have an index of 1, even if that column is not column "A". For example, if the lookup column is column "D", column "D" would have an index of 1.
  4. range_lookup = This is optional, but I recommend always using this option. Use FALSE for an exact match or TRUE for an approximate match. The default is TRUE. NOTE: 99% of use cases are going to use FALSE. This tutorial will focus on the option FALSE. Option TRUE tells excel we are looking for a value within a range of values, and the sort order can affect its behavior.

Example: Using VLOOKUP in Excel

A common use case for VLOOKUP may be merging some additional information with some RNA-seq differential expression results.

The data we will be examining:
1. RNASeq expression data
2. Gene symbols and gene product descriptions from the HUGO Gene Nomenclature Committee (HGNC).

Our differential expression results look like this:

Differential Expression Results

In the first column, we have our genes as Ensembl IDs followed by a column describing the reference genome used and then our differential expression results (logFC, logCPM, F, PValue, FDR).

We would like to combine this information with the official gene symbols and their descriptions from HGNC. That data looks like this:

Gene product descriptions

Let's add gene_symbol and Description to our differential expression results.

We have to add these columns one at a time, so we will begin with gene_symbol.

Using the VLOOKUP function (=VLOOKUP), we will enter the following 4 components:

  1. lookup_value: (A2:A15927)
  2. table_array: Here we are selecting a range of columns and rows from a different file ([gene_descriptions.txt]gene_descriptions!\$A:\$D)
  3. col_index_num: (3)
  4. range_lookup (FALSE)

VLOOKUP gene_symbol

Now, let's add our gene_description.

  1. lookup_value: (A2:A15927)
  2. table_array: Here we are selecting a range of columns and rows from a different file ([gene_descriptions.txt]gene_descriptions!\$A:\$D)
  3. col_index_num: (4)
  4. range_lookup (FALSE)

Notice the only thing that changed in our VLOOKUP formula was the col_index_num.

VLOOKUP gene_description

And our final result:

Merged File

This was pretty easy with excel and required minimal wrangling. We will see an example at the end of this lesson where the opposite is true. But first, let's see how we can get the same result with R.

VLOOKUP with R: combining gene expression results with gene symbols and descriptions

What is R?

R is a language and environment for statistical computing and graphics. ---r-project.org.

  • Open-source and widely used by scientists, not just bioinformaticians.
  • Extensible. R functionality is greatly expanded by using packages created by the R community.
  • Maintained by a network of collaborators - The R Core Team
  • A resource for and by scientists
  • R functionality makes it easy to develop and share packages on any topic.

Why use R?

R is a particularly great resource for statistical analysis, plotting, and report generating.

It's wide use translates to functions and packages covering a broad range of topics.

  • CRAN has 18,000 + packages.
  • Bioconductor (v 3.16) includes 2,183 software packages, 416 experiment data packages, 909 annotation packages, 28 workflows, and 8 books.

Help is easy to find with a quick web search.

R allows for reproducibility!

Why not use R?

The primary reason not to use R is that it has a rather steep learning curve. The best way to learn R is to use R, but to use R, you need to learn a bit of R.

Gene Expression Example.

We want to add the columns gene_symbol and Description from one data table to another data table containing our differential expression results.

First, let's load our data.

gene_info_fp<-file.path("..","data","S1_VLOOKUP","gene_expression",
              "gene_descriptions.txt") 

dexp_fp<-file.path("..","data","S1_VLOOKUP","gene_expression",
                   "diffexp_results_edger_airways.txt") 

gene_info<-read.delim(gene_info_fp)  

dexp<-read.delim(dexp_fp)  

There are several ways to do this in R.

Approach 1: Using base R merge()

Without installing any other packages we can use the base R function merge, which will "merge two data frames by common columns or row names, or do other versions of database join operations."

mdf<-merge(gene_info[c(1,3,4)],dexp,by.x="ensembl_gene_id",
           by.y="feature",all.y=TRUE)  

#Creating a tibble simply to limit the viewable output
tibble::as_tibble(mdf) 
# A tibble: 15,926 × 9
   ensembl_gene_id gene_symbol Description     ref_genome   logFC logCPM       F
   <chr>           <chr>       <chr>           <chr>        <dbl>  <dbl>   <dbl>
 1 ENSG00000000003 TSPAN6      tetraspanin 6   hg38       -0.390    5.06 32.8   
 2 ENSG00000000419 DPM1        dolichyl-phosp… hg38        0.198    4.61  6.90  
 3 ENSG00000000457 SCYL3       SCY1 like pseu… hg38        0.0292   3.48  0.0969
 4 ENSG00000000460 FIRRM       FIGNL1 interac… hg38       -0.124    1.47  0.377 
 5 ENSG00000000971 CFH         complement fac… hg38        0.417    8.09 29.3   
 6 ENSG00000001036 FUCA2       alpha-L-fucosi… hg38       -0.250    5.91 14.9   
 7 ENSG00000001084 GCLC        glutamate-cyst… hg38       -0.0581   4.84  0.167 
 8 ENSG00000001167 NFYA        nuclear transc… hg38       -0.509    4.13 44.9   
 9 ENSG00000001460 STPG1       sperm tail PG-… hg38       -0.136    3.12  1.04  
10 ENSG00000001461 NIPAL3      NIPA like doma… hg38       -0.0500   7.04  0.350 
# … with 15,916 more rows, and 2 more variables: PValue <dbl>, FDR <dbl>

Approach 2: using join functions from dplyr

In this case, we are going to use a left_join in order to retain all rows in our differential expression results. If we only wanted to include matching values, we could instead use an inner_join. See the documentation here.

A left_join keeps all observations in x (the first argument).

It's worth exploring the different types of joins in detail, as these can be extremely useful for controlling data frame merges. See mutating joins and filtering joins from dplyr.

library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
ljdf<-left_join(dexp, gene_info[c(1,3,4)], 
                by = c("feature" = "ensembl_gene_id"))

#Reordered for aesthetics only
tibble::as_tibble(select(ljdf,1,8:9,2:7))
# A tibble: 15,926 × 9
   feature     gene_symbol Description ref_genome   logFC logCPM       F  PValue
   <chr>       <chr>       <chr>       <chr>        <dbl>  <dbl>   <dbl>   <dbl>
 1 ENSG000000… TSPAN6      tetraspani… hg38       -0.390    5.06 32.8    3.12e-4
 2 ENSG000000… DPM1        dolichyl-p… hg38        0.198    4.61  6.90   2.81e-2
 3 ENSG000000… SCYL3       SCY1 like … hg38        0.0292   3.48  0.0969 7.63e-1
 4 ENSG000000… FIRRM       FIGNL1 int… hg38       -0.124    1.47  0.377  5.55e-1
 5 ENSG000000… CFH         complement… hg38        0.417    8.09 29.3    4.63e-4
 6 ENSG000000… FUCA2       alpha-L-fu… hg38       -0.250    5.91 14.9    4.05e-3
 7 ENSG000000… GCLC        glutamate-… hg38       -0.0581   4.84  0.167  6.92e-1
 8 ENSG000000… NFYA        nuclear tr… hg38       -0.509    4.13 44.9    1.00e-4
 9 ENSG000000… STPG1       sperm tail… hg38       -0.136    3.12  1.04   3.35e-1
10 ENSG000000… NIPAL3      NIPA like … hg38       -0.0500   7.04  0.350  5.69e-1
# … with 15,916 more rows, and 1 more variable: FDR <dbl>

Note

There are also packages with specific VLOOKUP functions, but many of these are using match and merge under the hood. For example, check out lookup.

Example 2: Understanding 10X TCR data using annotation output from the cellranger vdj pipeline.

Load the data:

library(readxl)

TCR_fp<-file.path("..","data","S1_VLOOKUP","TCR_example",
              "4261-R1_filtered_contig_annotations.xlsx") 


df<-read_excel(TCR_fp)

knTCR<-read_excel(TCR_fp,sheet=2)

Here, we want to use annotation data from a 10X TCR sequencing run and compare known TCR sequences. Specifically, we want to answer several questions using the contig annotation files from the cellranger vdj pipeline. For a detailed description of each column, see "Contig annotation CSV files", here.

Here's a summary of the cellranger vdj annotation file:

glimpse(df)
Rows: 8,028
Columns: 31
$ barcode               <chr> "AAACCTGAGGCGATAC-1", "AAACCTGAGGCGATAC-1", "AAA…
$ is_cell               <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ contig_id             <chr> "AAACCTGAGGCGATAC-1_contig_1", "AAACCTGAGGCGATAC…
$ high_confidence       <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ length                <dbl> 483, 519, 518, 483, 589, 507, 483, 568, 574, 490…
$ chain                 <chr> "TRB", "TRA", "TRA", "TRB", "TRA", "TRB", "TRB",…
$ v_gene                <chr> "TRBV4-1*01", "TRAV35*01", "TRAV35*01", "TRBV4-1…
$ d_gene                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ j_gene                <chr> "TRBJ2-1*01", "TRAJ28*01", "TRAJ28*01", "TRBJ2-1…
$ c_gene                <chr> "TRBC2*01", "TRAC*01", "TRAC*01", "TRBC2*01", "T…
$ full_length           <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ productive            <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ fwr1                  <chr> "DTEVTQTPKHLVMGMTNKKSLKCEQH", "GQQLNQSPQSMFIQEGE…
$ fwr1_nt               <chr> "GACACTGAAGTTACCCAGACACCAAAACACCTGGTCATGGGAATGAC…
$ cdr1                  <chr> "MGHRA", "SIFNT", "SIFNT", "MGHRA", "TSINN", "SG…
$ cdr1_nt               <chr> "ATGGGGCACAGGGCT", "AGCATATTTAACACC", "AGCATATTT…
$ fwr2                  <chr> "MYWYKQKAKKPPELMFV", "WLWYKQDPGEGPVLLIA", "WLWYK…
$ fwr2_nt               <chr> "ATGTATTGGTACAAGCAGAAAGCTAAGAAGCCACCGGAGCTCATGTT…
$ cdr2                  <chr> "YSYEKL", "LYKAGEL", "LYKAGEL", "YSYEKL", "IRSNE…
$ cdr2_nt               <chr> "TACAGCTATGAGAAACTC", "TTATATAAGGCTGGTGAATTG", "…
$ fwr3                  <chr> "SINESVPSRFSPECPNSSLLNLHLHALQPEDSALYL", "TSNGRLT…
$ fwr3_nt               <chr> "TCTATAAATGAAAGTGTGCCAAGTCGCTTCTCACCTGAATGCCCCAA…
$ cdr3                  <chr> "CASSQALAGPEQFF", "CAGQLISGAGSYQLTF", "CAGQLISGA…
$ cdr3_nt               <chr> "TGCGCCAGCAGCCAAGCACTAGCGGGTCCAGAGCAGTTCTTC", "T…
$ fwr4                  <chr> "GPGTRLTVL", "GKGTKLSVIP", "GKGTKLSVIP", "GPGTRL…
$ fwr4_nt               <chr> "GGGCCAGGGACACGGCTCACCGTGCTAG", "GGGAAGGGGACCAAA…
$ reads                 <dbl> 11975, 773, 1178, 9780, 2375, 16987, 886, 19290,…
$ umis                  <dbl> 78, 5, 8, 57, 18, 119, 5, 140, 64, 9, 40, 32, 44…
$ raw_clonotype_id      <chr> "clonotype1", "clonotype1", "clonotype1", "clono…
$ raw_consensus_id      <chr> "clonotype1_consensus_1", "clonotype1_consensus_…
$ exact_subclonotype_id <dbl> 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, …

Here is what our known TCR data frame looks like:

knTCR
# A tibble: 8 × 6
  Tumor Neoantigen CDR3b             TRBV        CDR3a            TRAV       
  <dbl> <chr>      <chr>             <chr>       <chr>            <chr>      
1  4261 ATL2.1     CASSLAPQGEAFF     TRBV7-6*01  CAGHGGATNKLIF    TRAV35*01  
2  4261 ATL2.2     CASSQERGLAGFNEQFF TRBV4-1*01  CASYTNAGKSTF     TRAV24*01  
3  4261 ATL2.3     CASSQALAGPEQFF    TRBV4-1*01  CAGQLISGAGSYQLTF TRAV35*01  
4  4261 GATA6.1    CASSFEADTQYF      TRBV12-4*01 CAMSSYSSASKIIF   TRAV12-3*01
5  4261 GATA6.2    CATSLEADTQYF      TRBV12-4*01 CAMSSYSSASKIIF   TRAV12-3*01
6  4261 GATA6.3    CASSLYGPLSSYNEQFF TRBV12-4*01 CILSGGGSNYKLTF   TRAV26-2*01
7  4261 GATA6.4    CASSLEATSVYNEQFF  TRBV5-4*01  CILRENNARLMF     TRAV26-2*01
8  4261 GATA6.5    CASSIDGAGGPPGELFF TRBV19*01   CASRTDSWGKLQF    TRAV38-1*01

We want to start by answering a few questions:

How often do the TCRs in knTCR show up in the cellranger output? (Counts)

How often do the CDR3a sequences match; how often do the CDR3b sequences match, and how often do we see that both match together?

In addition to counts, we want to actually be able to examine matched content.

Using Excel

These questions can be answered in excel. However, the data would need to be reshaped. To see what I mean, let's take a brief look at the data in Excel.

10X cellranger vdj annotation output

Notice that there are two or more rows represented by a single cell barcode, and each row is associated with TCR alpha chain or beta chain data. While most cells have a single TRA or TRB sequence, some have two TRAs or two TRBs. These will likely need to be assessed for potential error.

If we want to match CDR3a and CDR3b sequences simultaneously, ideally these should be in the same row.

Let's check out our standard VLOOKUP, without any data reshaping.

We will create two new columns in our data named "knTRA_match" and "knTRB_match".

Adding two new columns

Now, let's do a vlookup to match the cdr3 sequences.

VLOOKUP TCR data

This output is pretty rough without doing additional Excel wizardry for data reshaping. This data set isn't small (8,028 rows by 34 columns), so the probability of making a mistake while reformatting the data is high. Also, it isn't easy to trace steps in Excel, so if I plan to repeat this type of analysis in the future, I will likely spend a similar amount of time completing this task.

Tip

Using R will make this analysis less error prone and more reproducible. Once we generate a script, we can refer back to it and hopefully make minimal edits or no edits at all to apply the same steps to a new data set.

Return to R

Let's reshape the data. I will rely heavily on dplyr functions to perform these tasks.

First, I want to isolate the alpha chain and beta chain data.

#isolate alpha and beta
dfTRA<-df %>% filter(chain=="TRA")
colnames(dfTRA)<-paste(colnames(dfTRA),"TRA", sep="_")
dfTRB<-df %>% filter(chain=="TRB")
colnames(dfTRB)<-paste(colnames(dfTRB),"TRB", sep="_")

I'm also going to modify the column names in knTCR just to make my life easier for joining data sets later.

#modify column names in knTCR for easier matching
colnames(knTCR)
[1] "Tumor"      "Neoantigen" "CDR3b"      "TRBV"       "CDR3a"     
[6] "TRAV"      
colnames(knTCR)<-c("Tumor","Neoantigen","cdr3_TRB",
                   "v_gene_TRB","cdr3_TRA",
                   "v_gene_TRA")

Next, I'm going to join the data back together using a full_join. A full_join keeps all observations from both datasets.

#Join filtered data frames
df_full<-full_join(dfTRA,dfTRB,by = c("barcode_TRA" = "barcode_TRB")) %>%
  select(barcode_TRA,v_gene_TRA,cdr3_TRA,reads_TRA,
         raw_clonotype_id_TRA,v_gene_TRB,cdr3_TRB,
         reads_TRB,raw_clonotype_id_TRB)
#change name of first column to just barcode
colnames(df_full)[1]<-"barcode"

#Add a column with unique identifier
df_full$identifier<-1:nrow(df_full)

df_full
# A tibble: 4,649 × 10
   barcode    v_gene_TRA cdr3_TRA reads_TRA raw_clonotype_i… v_gene_TRB cdr3_TRB
   <chr>      <chr>      <chr>        <dbl> <chr>            <chr>      <chr>   
 1 AAACCTGAG… TRAV35*01  CAGQLIS…       773 clonotype1       TRBV4-1*01 CASSQAL…
 2 AAACCTGCA… TRAV35*01  CAGQLIS…      1178 clonotype1       TRBV4-1*01 CASSQAL…
 3 AAACCTGGT… TRAV17*01  CATARNS…      2375 clonotype415     TRBV12-3*… CASRTGL…
 4 AAACCTGTC… TRAV29/DV… CAAGVSD…      9616 clonotype2       TRBV30*01  CAWSAGV…
 5 AAACCTGTC… TRAV35*01  CAGQLVG…      1119 clonotype19      TRBV4-1*01 CASSFRM…
 6 AAACCTGTC… TRAV35*01  CAGQLVG…      1119 clonotype19      TRBV4-1*01 CASSQAL…
 7 AAACCTGTC… TRAV4*01   CLVGDPG…      2799 clonotype684     TRBV18*01  CASTGQG…
 8 AAACCTGTC… TRAV35*01  CAGQLIS…      1005 clonotype1       TRBV4-1*01 CASSQAL…
 9 AAACGGGAG… TRAV35*01  CAGQLIS…       681 clonotype1       TRBV4-1*01 CASSQAL…
10 AAACGGGAG… TRAV24*01  CASYTNA…      2117 clonotype3       TRBV4-2*01 CASSQER…
# … with 4,639 more rows, and 3 more variables: reads_TRB <dbl>,
#   raw_clonotype_id_TRB <chr>, identifier <int>

Now for matching and counting where our knowns are found in our unknowns.

Let's start with matching all combinations of cdr3_TRA and cdr3_TRB.

#create a data frame and inner join with knTCR
fullTRATCR<-knTCR %>% select(cdr3_TRA,cdr3_TRB) %>% 
  distinct(cdr3_TRA,cdr3_TRB) %>%
  inner_join(df_full) 
Joining, by = c("cdr3_TRA", "cdr3_TRB")

An inner_join() only keeps observations from x that have a matching key in y.

The most important property of an inner join is that unmatched rows in either input are not included in the result. This means that generally inner joins are not appropriate in most analyses, because it is too easy to lose observations.---dplyr.tidyverse.org

#Count where we have matches
fullTRATCRc<- fullTRATCR %>% 
  count(cdr3_TRA,cdr3_TRB,name="TRATRB_match_count")

Next, we can examine unique matches of cdr3_TRA only. However, we want to avoid double counting duplicate cdr3_TRAs as an artifact of data reshaping. To do this, we are going to create a data frame to use for filtering. We will do the same thing for TRB.

#create a table to avoid double counting

#we are only examining cells that had a match for both TRA and TRB 
#cdr3 sequences.
filttab_TRA<-df_full %>% filter(barcode %in% fullTRATCR$barcode) %>% 
  group_by(barcode) %>% 
  #count number of barcodes and number of distinct TRA or TRB 
  summarize(n_barcode=n(),n_distinct_TRA=n_distinct(cdr3_TRA),
            n_distinct_TRB=n_distinct(cdr3_TRB)) %>%
  #We want to be able to filter any instances of two barcodes where the 
  #TRA is the same
  filter(n_barcode==2 & n_distinct_TRA==1)

#repeat for TRB
filttab_TRB<-df_full %>% filter(barcode %in% fullTRATCR$barcode) %>%
  group_by(barcode) %>% 
  summarize(n_barcode=n(),n_distinct_TRA=n_distinct(cdr3_TRA),
            n_distinct_TRB=n_distinct(cdr3_TRB)) %>%
  filter(n_barcode==2 & n_distinct_TRB==1)

Get unique matches of cdr3_TRA.

#to look at unique matches, we will anti-join to drop 
#previously matched rows  
a<-df_full %>% anti_join(fullTRATCR) %>%
  #filter out the barcodes from our filter table above
  filter(!barcode %in% filttab_TRA$barcode) %>%
  group_by(barcode) %>% 
  #only count unique cdr3_TRA per barcode
  distinct(cdr3_TRA,.keep_all=TRUE) %>% ungroup() %>% 
  #perform an inner_join to match with knTCR
  inner_join(x=distinct(select(knTCR,cdr3_TRA)),y=.) 
Joining, by = c("barcode", "v_gene_TRA", "cdr3_TRA", "reads_TRA",
"raw_clonotype_id_TRA", "v_gene_TRB", "cdr3_TRB", "reads_TRB",
"raw_clonotype_id_TRB", "identifier")
Joining, by = "cdr3_TRA"
#retrieve match counts
ac<-a %>% count(cdr3_TRA,name="TRA_match_count")

ac
# A tibble: 4 × 2
  cdr3_TRA         TRA_match_count
  <chr>                      <int>
1 CAGQLISGAGSYQLTF               6
2 CAMSSYSSASKIIF                 5
3 CASYTNAGKSTF                  29
4 CILSGGGSNYKLTF                 2

Get unique matches of cdr3_TRB.

b<-df_full %>% anti_join(fullTRATCR) %>% 
  filter(!barcode %in% filttab_TRB$barcode) %>%
  group_by(barcode) %>%
  distinct(cdr3_TRB,.keep_all=TRUE) %>% ungroup() %>% 
  inner_join(x=distinct(select(knTCR,cdr3_TRB)),y=.)
Joining, by = c("barcode", "v_gene_TRA", "cdr3_TRA", "reads_TRA",
"raw_clonotype_id_TRA", "v_gene_TRB", "cdr3_TRB", "reads_TRB",
"raw_clonotype_id_TRB", "identifier")
Joining, by = "cdr3_TRB"
bc <- b %>% count(cdr3_TRB,name="TRB_match_count")

bc
# A tibble: 5 × 2
  cdr3_TRB          TRB_match_count
  <chr>                       <int>
1 CASSFEADTQYF                   27
2 CASSIDGAGGPPGELFF               1
3 CASSLYGPLSSYNEQFF              73
4 CASSQALAGPEQFF                 96
5 CASSQERGLAGFNEQFF              74

Now, let's create a table of match counts.

ctab<-knTCR %>% 
  left_join(ac) %>% 
  left_join(bc) %>% 
  left_join(fullTRATCRc) %>% 
  #This replaces NAs with 0.
  mutate(across(where(is.numeric),~ifelse(is.na(.),0,.)))
Joining, by = "cdr3_TRA"
Joining, by = "cdr3_TRB"
Joining, by = c("cdr3_TRB", "cdr3_TRA")
ctab
# A tibble: 8 × 9
  Tumor Neoantigen cdr3_TRB       v_gene_TRB cdr3_TRA v_gene_TRA TRA_match_count
  <dbl> <chr>      <chr>          <chr>      <chr>    <chr>                <dbl>
1  4261 ATL2.1     CASSLAPQGEAFF  TRBV7-6*01 CAGHGGA… TRAV35*01                0
2  4261 ATL2.2     CASSQERGLAGFN… TRBV4-1*01 CASYTNA… TRAV24*01               29
3  4261 ATL2.3     CASSQALAGPEQFF TRBV4-1*01 CAGQLIS… TRAV35*01                6
4  4261 GATA6.1    CASSFEADTQYF   TRBV12-4*… CAMSSYS… TRAV12-3*…               5
5  4261 GATA6.2    CATSLEADTQYF   TRBV12-4*… CAMSSYS… TRAV12-3*…               5
6  4261 GATA6.3    CASSLYGPLSSYN… TRBV12-4*… CILSGGG… TRAV26-2*…               2
7  4261 GATA6.4    CASSLEATSVYNE… TRBV5-4*01 CILRENN… TRAV26-2*…               0
8  4261 GATA6.5    CASSIDGAGGPPG… TRBV19*01  CASRTDS… TRAV38-1*…               0
# … with 2 more variables: TRB_match_count <dbl>, TRATRB_match_count <dbl>

Lastly, we may want to examine where we had matches.

Note

You will need to install the package data.table to use rbindlist.

tables<-list(a,b,fullTRATCR)
names(tables)<-c("TRA","TRB","TRA_TRB")
matches<-data.table::rbindlist(tables,use.names=TRUE,idcol="Type_of_Match")
df_full2<-df_full %>% left_join(matches)
Joining, by = c("barcode", "v_gene_TRA", "cdr3_TRA", "reads_TRA",
"raw_clonotype_id_TRA", "v_gene_TRB", "cdr3_TRB", "reads_TRB",
"raw_clonotype_id_TRB", "identifier")
df_full2
# A tibble: 4,649 × 11
   barcode    v_gene_TRA cdr3_TRA reads_TRA raw_clonotype_i… v_gene_TRB cdr3_TRB
   <chr>      <chr>      <chr>        <dbl> <chr>            <chr>      <chr>   
 1 AAACCTGAG… TRAV35*01  CAGQLIS…       773 clonotype1       TRBV4-1*01 CASSQAL…
 2 AAACCTGCA… TRAV35*01  CAGQLIS…      1178 clonotype1       TRBV4-1*01 CASSQAL…
 3 AAACCTGGT… TRAV17*01  CATARNS…      2375 clonotype415     TRBV12-3*… CASRTGL…
 4 AAACCTGTC… TRAV29/DV… CAAGVSD…      9616 clonotype2       TRBV30*01  CAWSAGV…
 5 AAACCTGTC… TRAV35*01  CAGQLVG…      1119 clonotype19      TRBV4-1*01 CASSFRM…
 6 AAACCTGTC… TRAV35*01  CAGQLVG…      1119 clonotype19      TRBV4-1*01 CASSQAL…
 7 AAACCTGTC… TRAV4*01   CLVGDPG…      2799 clonotype684     TRBV18*01  CASTGQG…
 8 AAACCTGTC… TRAV35*01  CAGQLIS…      1005 clonotype1       TRBV4-1*01 CASSQAL…
 9 AAACGGGAG… TRAV35*01  CAGQLIS…       681 clonotype1       TRBV4-1*01 CASSQAL…
10 AAACGGGAG… TRAV24*01  CASYTNA…      2117 clonotype3       TRBV4-2*01 CASSQER…
# … with 4,639 more rows, and 4 more variables: reads_TRB <dbl>,
#   raw_clonotype_id_TRB <chr>, identifier <int>, Type_of_Match <chr>

Now that we have this code, we can use it to examine additional cellranger vdj annotation output files. We would just need to make sure that the column names of any known TCR data we are interested in comparing are the same.

Interested in learning R

Check out our introductory courses:

  1. R Introductory Series
  2. Data Visualization with R
  3. Data Wrangling with R

References

  1. Bioinformatics Workbook: Data Wrangling

  2. Microsoft Documentation

Acknowledgements

The 10X TCR example data was generously provided by Paul Robbins. The differential expression results were derived from data extracted from the R package, airway, from Himes et al. 2014.

What's next for the BTEP Coding Club?

April 19th, 2023: Join us for a demo on Jupyter notebooks by Joe Wu.