Joining and Transforming Data with dplyr
Objectives
Today we will continue to wrangle data using the tidyverse package, dplyr
. We will learn:
- how to join data frames using
dplyr
- how to transform and create new variables using
mutate()
To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.
Loading Tidyverse
In this lesson, we are continuing with the package dplyr
. We do not need to load the dplyr
package separately, as it is a core tidyverse
package. Again, if you need to install and load only dplyr
, use install.packages("dplyr")
and library(dplyr)
.
Load the package:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the data
Let's load in some data to work with. In this lesson, we will continue to use sample metadata, raw count data, and differential expression results derived from the airway RNA-Seq project.
Get the sample metadata:
#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")
Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Get the raw counts:
#raw count data
acount<-read_csv("./data/airway_rawcount.csv") %>%
dplyr::rename("Feature" = "...1")
New names:
Rows: 64102 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
SRR1039...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Get the differential expression results:
#differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")
Rows: 15926 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining data frames
Any given project will often include multiple sets of data from different sources. These related data are generally stored across multiple data frames. In such cases, while each data frame likely contains different types of data, an identifier column or key (e.g., "sampleID") can be used to unite or combine aspects of the data, which is useful depending on your analysis goal(s).
There are a series of functions from dplyr
devoted to the purpose of joining data frames. There are two types of joins: mutating joins and filtering joins.
Mutating joins
Imagine we have two data frames x
and y
. A mutating join will keep all columns from x
and y
by adding columns from y
to x
.
left_join()
- Output contains all rows from x
return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8)
Animation from Tidyexplain, Garrick Aden-Buie
right_join()
- Output contains all rows from y
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8).
Animation from Tidyexplain, Garrick Aden-Buie
inner_join()
- Output contains matched rows from x
return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. --- R documentation, dplyr (version 0.7.8)
Unmatched values from x
and unmatched values from y
will be dropped. So use caution, as it is easy to lose observations with an inner join.
Animation from Tidyexplain, Garrick Aden-Buie
full_join()
- Output contains all rows from x
and y
return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. --- R documentation, dplyr (version 0.7.8).
Animation from Tidyexplain, Garrick Aden-Buie
Note
The R documentation for dplyr was updated with dplyr v1.0.9. However, these descriptions still stand and are clearer (in my opinion) than the new documentation.
The most common type of join is the left_join()
. Let's see this in action:
#reshape acount
acount_smeta<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
values_to= "Count") %>% left_join(smeta, by=c("Sample"="Run"))
acount_smeta
# A tibble: 512,816 × 11
Feature Sample Count SampleName cell dex albut avgLength Experiment
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
1 ENSG000000000… SRR10… 679 GSM1275862 N613… untrt untrt 126 SRX384345
2 ENSG000000000… SRR10… 448 GSM1275863 N613… trt untrt 126 SRX384346
3 ENSG000000000… SRR10… 873 GSM1275866 N052… untrt untrt 126 SRX384349
4 ENSG000000000… SRR10… 408 GSM1275867 N052… trt untrt 87 SRX384350
5 ENSG000000000… SRR10… 1138 GSM1275870 N080… untrt untrt 120 SRX384353
6 ENSG000000000… SRR10… 1047 GSM1275871 N080… trt untrt 126 SRX384354
7 ENSG000000000… SRR10… 770 GSM1275874 N061… untrt untrt 101 SRX384357
8 ENSG000000000… SRR10… 572 GSM1275875 N061… trt untrt 98 SRX384358
9 ENSG000000000… SRR10… 0 GSM1275862 N613… untrt untrt 126 SRX384345
10 ENSG000000000… SRR10… 0 GSM1275863 N613… trt untrt 126 SRX384346
# ℹ 512,806 more rows
# ℹ 2 more variables: Sample.y <chr>, BioSample <chr>
Notice the use of by
in left_join
. The argument by
requires the column or columns that we want to join by. If the column we want to join by has a different name, we can use the notation above, which says to match Sample
from acount
to Run
from smeta
.
Filtering joins
Filtering joins result in filtered x
data based on matching or non-matching with y
. These joins do not add columns from y
to x
.
semi_join()
return all rows from x where there are matching values in y, keeping just columns from x. --- R documentation, dplyr (version 0.7.8)
Animation from Tidyexplain, Garrick Aden-Buie
anti_join()
return all rows from x where there are not matching values in y, keeping just columns from x. --- R documentation, dplyr (version 0.7.8)
Animation from Tidyexplain, Garrick Aden-Buie
Let's see a brief example of semi-join:
#reshape acount
smeta_f<-smeta %>% filter(Run %in% c("SRR1039512","SRR1039508"))
acount_L<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
values_to= "Count")
semi_join(acount_L,smeta_f, by=c("Sample"="Run"))
# A tibble: 128,204 × 3
Feature Sample Count
<chr> <chr> <dbl>
1 ENSG00000000003 SRR1039508 679
2 ENSG00000000003 SRR1039512 873
3 ENSG00000000005 SRR1039508 0
4 ENSG00000000005 SRR1039512 0
5 ENSG00000000419 SRR1039508 467
6 ENSG00000000419 SRR1039512 621
7 ENSG00000000457 SRR1039508 260
8 ENSG00000000457 SRR1039512 263
9 ENSG00000000460 SRR1039508 60
10 ENSG00000000460 SRR1039512 40
# ℹ 128,194 more rows
In this case, we could have used filter. However, it is easier to use a filtering join if we know we want to save elements from another table. This saves us from having to determine the filtering criteria for use with filter()
.
Transforming variables
Data wrangling often involves transforming one variable to another. For example, we may be interested in log transforming a variable or adding two variables to create a third. In dplyr
this can be done with mutate()
. mutate()
allows us to create a new variable from existing variables.
mutate()
mutate() creates new columns that are functions of existing variables. It can also modify (if the name is the same as an existing column) and delete columns (by setting their value to NULL). --- dplyr.tidyverse.org
Let's create a column in our original differential expression data frame denoting significant transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater than or equal to 2).
dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)
head(dexp_sigtrnsc$Significant)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
This creates a column named Significant
that contains TRUE values where the expression above was true (meaning significant in this case) and FALSE where the expression was FALSE.
.keep
You can control which columns from the data are included in your output using .keep
.
- "all"
- default - keep all columns.
- "used"
- keeps the transformed columns and new columns.
- "unused"
- keeps only unused column and new columns.
- "none"
- keeps the new columns and grouping variables.
Recoding variables based on values
dplyr
offers functions for recoding variables: if_else()
and case_when()
.
if_else - uses two logical conditions
dexp_sigtrnsc2<- dexp %>%
mutate(Significant= if_else(FDR<0.05 & abs(logFC) >=2,
"Significant", "Not Significant"))
case_when - uses multiple logical conditions. Case_when uses a series of formulas (Syntax: logical_test ~ Value_if_True
).
dexp_sigtrnsc3<- dexp %>%
mutate(Significant=
case_when(FDR<0.05 & logFC >=2 ~ "Up",
FDR<0.05 & logFC <=-2 ~ "Down",
.default = "Not Significant")
)
Let's look at another example. This time let's log transform our FDR corrected p-values.
dexp %>% mutate(logFDR = log10(FDR), .keep="none")
# A tibble: 15,926 × 1
logFDR
<dbl>
1 -2.55
2 -1.11
3 -0.0735
4 -0.166
5 -2.42
6 -1.73
7 -0.100
8 -2.90
9 -0.320
10 -0.158
# ℹ 15,916 more rows
Here, .keep="none"
resulted in retaining only a single column ("logFDR").
Mutating several variables at once
What if we want to transform all of our counts spread across multiple columns in acount
using scale()
, which applies a z-score transformation? In this case we use across()
within mutate()
, which has replaced the scoped verbs (mutate_if
,mutate_at
, and mutate_all
).
Let's see this in action.
acount %>% mutate(across(where(is.numeric),scale))
# A tibble: 64,102 × 9
Feature SRR1039508[,1] SRR1039509[,1] SRR1039512[,1] SRR1039513[,1]
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 0.103 0.0527 0.0991 0.0643
2 ENSG00000000005 -0.0929 -0.100 -0.0821 -0.0887
3 ENSG00000000419 0.0418 0.0756 0.0468 0.0482
4 ENSG00000000457 -0.0179 -0.0281 -0.0275 -0.0272
5 ENSG00000000460 -0.0756 -0.0814 -0.0738 -0.0756
6 ENSG00000000938 -0.0929 -0.100 -0.0817 -0.0887
7 ENSG00000000971 0.845 1.16 1.20 1.51
8 ENSG00000001036 0.321 0.262 0.278 0.242
9 ENSG00000001084 0.0568 0.0295 0.0414 0.0962
10 ENSG00000001167 0.0208 -0.0196 0.0142 -0.0231
# ℹ 64,092 more rows
# ℹ 4 more variables: SRR1039516 <dbl[,1]>, SRR1039517 <dbl[,1]>,
# SRR1039520 <dbl[,1]>, SRR1039521 <dbl[,1]>
For further information on across
, check out this great tutorial here.
Coercing variables with mutate
Mutate can also be used to coerce variables. Again, we need to use across()
and where()
.
#convert character vectors to factors
ex_coerce<-acount_smeta %>% mutate(across(where(is.character),as.factor))
Using rowwise()
and mutate()
mutate()
works across columns, and it is not as easy to apply operations across rows for some functions (e.g., mean
).
What if we wanted a new column that stored the mean of each row in our data frame?
Let's create a small data frame, and use mutate()
to get the mean()
. What happens when we use mean
as is?
df<-data.frame(A=c(1,2,3),B=c(4,5,6),C=c(7,8,9))
df
A B C
1 1 4 7
2 2 5 8
3 3 6 9
df %>% mutate(D= mean(c(A,B,C)))
A B C D
1 1 4 7 5
2 2 5 8 5
3 3 6 9 5
df %>% mutate(D = (A+B+C)/3)
A B C D
1 1 4 7 4
2 2 5 8 5
3 3 6 9 6
The first example simply gives us the mean of A, B, and C (not row wise). The second example gave us what we wanted due to vectorization (Read more on vectorization in references listed here).
For the first example to work as expected, we can first group by row using rowwise()
and then use mutate()
.
df %>% rowwise() %>% mutate(D= mean(c(A,B,C)))
# A tibble: 3 × 4
# Rowwise:
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 4 7 4
2 2 5 8 5
3 3 6 9 6
See more uses of rowwise()
operations here.
What's next?
Now that you know the basics of working with R and the key operations to wrangle your data, it is time to learn how to visualize your data. Part 3 of this course will introduce data visualization with ggplot2
. Stay tuned for upcoming course dates.
Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Chapter 13, Relational Data, from R for Data Science and Suzan Baert's dplyr tutorials.