Skip to content

Joining and Transforming Data with dplyr

Objectives

Today we will continue to wrangle data using the tidyverse package, dplyr. We will learn:

  1. how to join data frames using dplyr
  2. how to transform and create new variables using mutate()

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Loading Tidyverse

In this lesson, we are continuing with the package dplyr. We do not need to load the dplyr package separately, as it is a core tidyverse package. Again, if you need to install and load only dplyr, use install.packages("dplyr") and library(dplyr).

Load the package:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data

Let's load in some data to work with. In this lesson, we will continue to use sample metadata, raw count data, and differential expression results derived from the airway RNA-Seq project.

Get the sample metadata:

#sample information  
smeta<-read_delim("./data/airway_sampleinfo.txt")
Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the raw counts:

#raw count data
acount<-read_csv("./data/airway_rawcount.csv") %>%
  dplyr::rename("Feature" = "...1")
New names:
Rows: 64102 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
SRR1039...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Get the differential expression results:

#differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")
Rows: 15926 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Joining data frames

Any given project will often include multiple sets of data from different sources. These related data are generally stored across multiple data frames. In such cases, while each data frame likely contains different types of data, an identifier column or key (e.g., "sampleID") can be used to unite or combine aspects of the data, which is useful depending on your analysis goal(s).

There are a series of functions from dplyr devoted to the purpose of joining data frames. There are two types of joins: mutating joins and filtering joins.

Mutating joins

Imagine we have two data frames x and y. A mutating join will keep all columns from x and y by adding columns from y to x.

left_join() - Output contains all rows from x

return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8)

Left Join

Animation from Tidyexplain, Garrick Aden-Buie

right_join() - Output contains all rows from y

return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8).

Right Join

Animation from Tidyexplain, Garrick Aden-Buie

inner_join() - Output contains matched rows from x

return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. --- R documentation, dplyr (version 0.7.8)

Unmatched values from x and unmatched values from y will be dropped. So use caution, as it is easy to lose observations with an inner join.

Inner Join

Animation from Tidyexplain, Garrick Aden-Buie

full_join() - Output contains all rows from x and y

return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. --- R documentation, dplyr (version 0.7.8).

Full Join

Animation from Tidyexplain, Garrick Aden-Buie

Note

The R documentation for dplyr was updated with dplyr v1.0.9. However, these descriptions still stand and are clearer (in my opinion) than the new documentation.

The most common type of join is the left_join(). Let's see this in action:

#reshape acount
acount_smeta<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
                    values_to= "Count") %>% left_join(smeta, by=c("Sample"="Run"))
acount_smeta
# A tibble: 512,816 × 11
   Feature        Sample Count SampleName cell  dex   albut avgLength Experiment
   <chr>          <chr>  <dbl> <chr>      <chr> <chr> <chr>     <dbl> <chr>     
 1 ENSG000000000… SRR10…   679 GSM1275862 N613… untrt untrt       126 SRX384345 
 2 ENSG000000000… SRR10…   448 GSM1275863 N613… trt   untrt       126 SRX384346 
 3 ENSG000000000… SRR10…   873 GSM1275866 N052… untrt untrt       126 SRX384349 
 4 ENSG000000000… SRR10…   408 GSM1275867 N052… trt   untrt        87 SRX384350 
 5 ENSG000000000… SRR10…  1138 GSM1275870 N080… untrt untrt       120 SRX384353 
 6 ENSG000000000… SRR10…  1047 GSM1275871 N080… trt   untrt       126 SRX384354 
 7 ENSG000000000… SRR10…   770 GSM1275874 N061… untrt untrt       101 SRX384357 
 8 ENSG000000000… SRR10…   572 GSM1275875 N061… trt   untrt        98 SRX384358 
 9 ENSG000000000… SRR10…     0 GSM1275862 N613… untrt untrt       126 SRX384345 
10 ENSG000000000… SRR10…     0 GSM1275863 N613… trt   untrt       126 SRX384346 
# ℹ 512,806 more rows
# ℹ 2 more variables: Sample.y <chr>, BioSample <chr>

Notice the use of by in left_join. The argument by requires the column or columns that we want to join by. If the column we want to join by has a different name, we can use the notation above, which says to match Sample from acount to Run from smeta.

Filtering joins

Filtering joins result in filtered x data based on matching or non-matching with y. These joins do not add columns from y to x.

semi_join()

return all rows from x where there are matching values in y, keeping just columns from x. --- R documentation, dplyr (version 0.7.8)

Semi Join

Animation from Tidyexplain, Garrick Aden-Buie

anti_join()

return all rows from x where there are not matching values in y, keeping just columns from x. --- R documentation, dplyr (version 0.7.8)

Anti Join

Animation from Tidyexplain, Garrick Aden-Buie

Let's see a brief example of semi-join:

#reshape acount
smeta_f<-smeta %>% filter(Run %in% c("SRR1039512","SRR1039508"))  

acount_L<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
                    values_to= "Count")

semi_join(acount_L,smeta_f, by=c("Sample"="Run"))
# A tibble: 128,204 × 3
   Feature         Sample     Count
   <chr>           <chr>      <dbl>
 1 ENSG00000000003 SRR1039508   679
 2 ENSG00000000003 SRR1039512   873
 3 ENSG00000000005 SRR1039508     0
 4 ENSG00000000005 SRR1039512     0
 5 ENSG00000000419 SRR1039508   467
 6 ENSG00000000419 SRR1039512   621
 7 ENSG00000000457 SRR1039508   260
 8 ENSG00000000457 SRR1039512   263
 9 ENSG00000000460 SRR1039508    60
10 ENSG00000000460 SRR1039512    40
# ℹ 128,194 more rows

In this case, we could have used filter. However, it is easier to use a filtering join if we know we want to save elements from another table. This saves us from having to determine the filtering criteria for use with filter().

Transforming variables

Data wrangling often involves transforming one variable to another. For example, we may be interested in log transforming a variable or adding two variables to create a third. In dplyr this can be done with mutate(). mutate() allows us to create a new variable from existing variables.

mutate()

mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL). --- dplyr.tidyverse.org

Let's create a column in our original differential expression data frame denoting significant transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater than or equal to 2).

dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)
head(dexp_sigtrnsc$Significant)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

This creates a column named Significant that contains TRUE values where the expression above was true (meaning significant in this case) and FALSE where the expression was FALSE.

.keep

You can control which columns from the data are included in your output using .keep.
- "all" - default - keep all columns.
- "used" - keeps the transformed columns and new columns.
- "unused" - keeps only unused column and new columns.
- "none" - keeps the new columns and grouping variables.

Recoding variables based on values

dplyr offers functions for recoding variables: if_else() and case_when().

if_else - uses two logical conditions

dexp_sigtrnsc2<- dexp %>% 
  mutate(Significant= if_else(FDR<0.05 & abs(logFC) >=2,
                              "Significant", "Not Significant"))

case_when - uses multiple logical conditions. Case_when uses a series of formulas (Syntax: logical_test ~ Value_if_True).

dexp_sigtrnsc3<- dexp %>% 
  mutate(Significant= 
           case_when(FDR<0.05 & logFC >=2 ~ "Up",
                     FDR<0.05 & logFC <=-2 ~ "Down",
                     .default = "Not Significant")
  )

Let's look at another example. This time let's log transform our FDR corrected p-values.

dexp %>% mutate(logFDR = log10(FDR), .keep="none") 
# A tibble: 15,926 × 1
    logFDR
     <dbl>
 1 -2.55  
 2 -1.11  
 3 -0.0735
 4 -0.166 
 5 -2.42  
 6 -1.73  
 7 -0.100 
 8 -2.90  
 9 -0.320 
10 -0.158 
# ℹ 15,916 more rows

Here, .keep="none" resulted in retaining only a single column ("logFDR").

Mutating several variables at once

What if we want to transform all of our counts spread across multiple columns in acount using scale(), which applies a z-score transformation? In this case we use across() within mutate(), which has replaced the scoped verbs (mutate_if,mutate_at, and mutate_all).

Let's see this in action.

acount %>% mutate(across(where(is.numeric),scale))
# A tibble: 64,102 × 9
   Feature         SRR1039508[,1] SRR1039509[,1] SRR1039512[,1] SRR1039513[,1]
   <chr>                    <dbl>          <dbl>          <dbl>          <dbl>
 1 ENSG00000000003         0.103          0.0527         0.0991         0.0643
 2 ENSG00000000005        -0.0929        -0.100         -0.0821        -0.0887
 3 ENSG00000000419         0.0418         0.0756         0.0468         0.0482
 4 ENSG00000000457        -0.0179        -0.0281        -0.0275        -0.0272
 5 ENSG00000000460        -0.0756        -0.0814        -0.0738        -0.0756
 6 ENSG00000000938        -0.0929        -0.100         -0.0817        -0.0887
 7 ENSG00000000971         0.845          1.16           1.20           1.51  
 8 ENSG00000001036         0.321          0.262          0.278          0.242 
 9 ENSG00000001084         0.0568         0.0295         0.0414         0.0962
10 ENSG00000001167         0.0208        -0.0196         0.0142        -0.0231
# ℹ 64,092 more rows
# ℹ 4 more variables: SRR1039516 <dbl[,1]>, SRR1039517 <dbl[,1]>,
#   SRR1039520 <dbl[,1]>, SRR1039521 <dbl[,1]>

For further information on across, check out this great tutorial here.

Coercing variables with mutate

Mutate can also be used to coerce variables. Again, we need to use across() and where().

#convert character vectors to factors
ex_coerce<-acount_smeta %>% mutate(across(where(is.character),as.factor)) 

Using rowwise() and mutate()

mutate() works across columns, and it is not as easy to apply operations across rows for some functions (e.g., mean).

What if we wanted a new column that stored the mean of each row in our data frame?

Let's create a small data frame, and use mutate() to get the mean(). What happens when we use mean as is?

df<-data.frame(A=c(1,2,3),B=c(4,5,6),C=c(7,8,9))
df
  A B C
1 1 4 7
2 2 5 8
3 3 6 9
df %>% mutate(D= mean(c(A,B,C)))
  A B C D
1 1 4 7 5
2 2 5 8 5
3 3 6 9 5
df %>% mutate(D = (A+B+C)/3)
  A B C D
1 1 4 7 4
2 2 5 8 5
3 3 6 9 6

The first example simply gives us the mean of A, B, and C (not row wise). The second example gave us what we wanted due to vectorization (Read more on vectorization in references listed here).

For the first example to work as expected, we can first group by row using rowwise() and then use mutate().

df %>% rowwise() %>% mutate(D= mean(c(A,B,C)))
# A tibble: 3 × 4
# Rowwise: 
      A     B     C     D
  <dbl> <dbl> <dbl> <dbl>
1     1     4     7     4
2     2     5     8     5
3     3     6     9     6

See more uses of rowwise() operations here.

What's next?

Now that you know the basics of working with R and the key operations to wrangle your data, it is time to learn how to visualize your data. Part 3 of this course will introduce data visualization with ggplot2. Stay tuned for upcoming course dates.

Acknowledgments

Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Chapter 13, Relational Data, from R for Data Science and Suzan Baert's dplyr tutorials.