dplyr
: joining, tranforming, and summarizing data frames
Objectives
Today we will continue to wrangle data using the tidyverse package, dplyr
. We will learn:
- how to join data frames using
dplyr
- how to transform and create new variables using
mutate()
- how to summarize variables using
group_by()
andsummarize()
Loading dplyr
In this lesson, we are continuing with the package dplyr
. We do not need to load the dplyr
package separately, as it is a core tidyverse
package. Again, if you need to install and load only dplyr
, use install.packages("dplyr")
and library(dplyr)
.
Load the package:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Data
Let's load in some data to work with. In this lesson, we will continue to use sample metadata, raw count data, and differential expression results from the airway
RNA-Seq project.
Load the data:
#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")
## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smeta
## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0242…
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0242…
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0242…
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0242…
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0242…
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0242…
## 7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0242…
## 8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0242…
#raw count data
acount<-read_csv("./data/airway_rawcount.csv") %>%
dplyr::rename("Feature" = "...1")
## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
acount
## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 679 448 873 408 1138 1047
## 2 ENSG000000… 0 0 0 0 0 0
## 3 ENSG000000… 467 515 621 365 587 799
## 4 ENSG000000… 260 211 263 164 245 331
## 5 ENSG000000… 60 55 40 35 78 63
## 6 ENSG000000… 0 0 2 0 1 0
## 7 ENSG000000… 3251 3679 6177 4252 6721 11027
## 8 ENSG000000… 1433 1062 1733 881 1424 1439
## 9 ENSG000000… 519 380 595 493 820 714
## 10 ENSG000000… 394 236 464 175 658 584
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
#differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")
## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dexp
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000… untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 3.12e-4
## 2 ENSG000… untrt DPM1 hg38 TRUE 0.198 4.61 6.90 2.81e-2
## 3 ENSG000… untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 7.63e-1
## 4 ENSG000… untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 5.55e-1
## 5 ENSG000… untrt CFH hg38 TRUE 0.417 8.09 29.3 4.63e-4
## 6 ENSG000… untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 4.05e-3
## 7 ENSG000… untrt GCLC hg38 TRUE -0.0581 4.84 0.167 6.92e-1
## 8 ENSG000… untrt NFYA hg38 TRUE -0.509 4.13 44.9 1.00e-4
## 9 ENSG000… untrt STPG1 hg38 TRUE -0.136 3.12 1.04 3.35e-1
## 10 ENSG000… untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 5.69e-1
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
Joining data frames
Often related data is stored across multiple data frames. In such cases, while each data frame likely contains different types of data, an identifier column or key (e.g., sampleID) can be used to unite or combine aspects of the data.
There are a series of functions from dplyr
devoted to the purpose of joining data frames. There are two types of joins: mutating joins and filtering joins.
Mutating joins
Imagine we have two data frames x
and y
. A mutating join will keep all columns from x
and y
by adding columns from y
to x
.
left_join()
- Output contains all rows from x
return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8)
right_join()
- Output contains all rows from y
return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned. --- R documentation, dplyr (version 0.7.8)
inner_join()
- Output contains matched rows from x
return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned. --- R documentation, dplyr (version 0.7.8)
Unmatched values from x
will be dropped.
full_join()
- Output contains all rows from x
and y
return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing. --- R documentation, dplyr (version 0.7.8)
Note
The R documentation for dplyr has been updated with dplyr v1.0.9. However, these descriptions still stand and are clearer (in my opinion) than the new documentation.
The most common type of join is the left_join()
. Let's see this in action:
#reshape acount
acount_smeta<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
values_to= "Count") %>% left_join(smeta, by=c("Sample"="Run"))
acount_smeta
## # A tibble: 512,816 × 11
## Feature Sample Count SampleName cell dex albut avgLength Experiment
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 ENSG000000000… SRR10… 679 GSM1275862 N613… untrt untrt 126 SRX384345
## 2 ENSG000000000… SRR10… 448 GSM1275863 N613… trt untrt 126 SRX384346
## 3 ENSG000000000… SRR10… 873 GSM1275866 N052… untrt untrt 126 SRX384349
## 4 ENSG000000000… SRR10… 408 GSM1275867 N052… trt untrt 87 SRX384350
## 5 ENSG000000000… SRR10… 1138 GSM1275870 N080… untrt untrt 120 SRX384353
## 6 ENSG000000000… SRR10… 1047 GSM1275871 N080… trt untrt 126 SRX384354
## 7 ENSG000000000… SRR10… 770 GSM1275874 N061… untrt untrt 101 SRX384357
## 8 ENSG000000000… SRR10… 572 GSM1275875 N061… trt untrt 98 SRX384358
## 9 ENSG000000000… SRR10… 0 GSM1275862 N613… untrt untrt 126 SRX384345
## 10 ENSG000000000… SRR10… 0 GSM1275863 N613… trt untrt 126 SRX384346
## # ℹ 512,806 more rows
## # ℹ 2 more variables: Sample.y <chr>, BioSample <chr>
Notice the use of by
in left_join
. The argument by
requires the column or columns that we want to join by. If the column we want to join by has a different name, we can use the notation above, which says to match Sample
from acount
to Run
from smeta
.
Filtering joins
Filtering joins result in filtered x
data based on matching or non-matching with y
. These joins do not add columns from y
to x
.
semi_join()
return all rows from x where there are matching values in y, keeping just columns from x.
A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x. --- R documentation, dplyr (version 0.7.8)
anti_join()
return all rows from x where there are not matching values in y, keeping just columns from x. --- R documentation, dplyr (version 0.7.8)
Let's see a brief example of semi-join:
#reshape acount
smeta_f<-smeta %>% filter(Run %in% c("SRR1039512","SRR1039508"))
acount_L<-acount %>% pivot_longer(where(is.numeric),names_to ="Sample",
values_to= "Count")
semi_join(acount_L,smeta_f, by=c("Sample"="Run"))
Note
This example does not use the %>%
. This was simply to demonstrate the different "strategies" that can be used to set up and run code.
## # A tibble: 128,204 × 3
## Feature Sample Count
## <chr> <chr> <dbl>
## 1 ENSG00000000003 SRR1039508 679
## 2 ENSG00000000003 SRR1039512 873
## 3 ENSG00000000005 SRR1039508 0
## 4 ENSG00000000005 SRR1039512 0
## 5 ENSG00000000419 SRR1039508 467
## 6 ENSG00000000419 SRR1039512 621
## 7 ENSG00000000457 SRR1039508 260
## 8 ENSG00000000457 SRR1039512 263
## 9 ENSG00000000460 SRR1039508 60
## 10 ENSG00000000460 SRR1039512 40
## # ℹ 128,194 more rows
semi-join
.
Transforming variables
Data wrangling often involves transforming one variable to another. For example, we may be interested in log transforming a variable or adding two variables to create a third. In dplyr
this can be done with mutate()
and transmute()
. These functions allow us to create a new variable from existing variables.
mutate()
mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. --- dplyr.tidyverse.org
Let's create a column in our original differential expression data frame denoting significant transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater than or equal to 2).
dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)
head(dexp_sigtrnsc$Significant)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Significant
that contains TRUE values where the expression above was true (meaning significant in this case) and FALSE where the expression was FALSE.
Let's look at another example. This time let's log transform our FDR corrected p-values.
exmut<-dexp %>% mutate(logFDR = log10(FDR))
exmut["logFDR"]
## # A tibble: 15,926 × 1
## logFDR
## <dbl>
## 1 -2.55
## 2 -1.11
## 3 -0.0735
## 4 -0.166
## 5 -2.42
## 6 -1.73
## 7 -0.100
## 8 -2.90
## 9 -0.320
## 10 -0.158
## # ℹ 15,916 more rows
Mutating several variables at once
What if we want to transform all of our counts spread across multiple columns in acount
using scale()
, which applies a z-score transformation? In this case we use across()
within mutate()
, which has replaced the scoped verbs (mutate_if
,mutate_at
, and mutate_all
).
Z-score
A z score tells us the number of standard deviations from the mean of a given value. This can be achieved by scale(x, center = TRUE, scale = TRUE)
.
Let's see this in action.
acount %>% mutate(across(where(is.numeric),scale))
## # A tibble: 64,102 × 9
## Feature SRR1039508[,1] SRR1039509[,1] SRR1039512[,1] SRR1039513[,1]
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 0.103 0.0527 0.0991 0.0643
## 2 ENSG00000000005 -0.0929 -0.100 -0.0821 -0.0887
## 3 ENSG00000000419 0.0418 0.0756 0.0468 0.0482
## 4 ENSG00000000457 -0.0179 -0.0281 -0.0275 -0.0272
## 5 ENSG00000000460 -0.0756 -0.0814 -0.0738 -0.0756
## 6 ENSG00000000938 -0.0929 -0.100 -0.0817 -0.0887
## 7 ENSG00000000971 0.845 1.16 1.20 1.51
## 8 ENSG00000001036 0.321 0.262 0.278 0.242
## 9 ENSG00000001084 0.0568 0.0295 0.0414 0.0962
## 10 ENSG00000001167 0.0208 -0.0196 0.0142 -0.0231
## # ℹ 64,092 more rows
## # ℹ 4 more variables: SRR1039516 <dbl[,1]>, SRR1039517 <dbl[,1]>,
## # SRR1039520 <dbl[,1]>, SRR1039521 <dbl[,1]>
For further information on across
, check out this great tutorial here.
Coercing variables with mutate
Mutate can also be used to coerce variables. Again, we need to use across()
and where()
. Let's also check out the difference between mutate()
and transmute()
.
#compare transmute to mutate
ex_coerce<-acount_smeta %>% transmute(across(where(is.character),as.factor))
glimpse(ex_coerce)
## Rows: 512,816
## Columns: 9
## $ Feature <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG0000…
## $ Sample <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,…
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM1275870,…
## $ cell <fct> N61311, N61311, N052611, N052611, N080611, N080611, N061011…
## $ dex <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, trt,…
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SRX3…
## $ Sample.y <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SRS5…
## $ BioSample <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, SAM…
#mutate
ex_coerce<-acount_smeta %>% mutate(across(where(is.character),as.factor))
glimpse(ex_coerce)
## Rows: 512,816
## Columns: 11
## $ Feature <fct> ENSG00000000003, ENSG00000000003, ENSG00000000003, ENSG0000…
## $ Sample <fct> SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,…
## $ Count <dbl> 679, 448, 873, 408, 1138, 1047, 770, 572, 0, 0, 0, 0, 0, 0,…
## $ SampleName <fct> GSM1275862, GSM1275863, GSM1275866, GSM1275867, GSM1275870,…
## $ cell <fct> N61311, N61311, N052611, N052611, N080611, N080611, N061011…
## $ dex <fct> untrt, trt, untrt, trt, untrt, trt, untrt, trt, untrt, trt,…
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
## $ avgLength <dbl> 126, 126, 126, 87, 120, 126, 101, 98, 126, 126, 126, 87, 12…
## $ Experiment <fct> SRX384345, SRX384346, SRX384349, SRX384350, SRX384353, SRX3…
## $ Sample.y <fct> SRS508568, SRS508567, SRS508571, SRS508572, SRS508575, SRS5…
## $ BioSample <fct> SAMN02422669, SAMN02422675, SAMN02422678, SAMN02422670, SAM…
Notice that transmute
dropped all columns that were not mutated.
Using rowwise()
and mutate()
What if we wanted a new column that stored the mean of each row in our data frame?
Let's create a small data frame, and use mutate()
to get the mean()
. What happens when we use mean
as is?
df<-data.frame(A=c(1,2,3),B=c(4,5,6),C=c(7,8,9))
df
## A B C
## 1 1 4 7
## 2 2 5 8
## 3 3 6 9
df %>% mutate(D= mean(c(A,B,C)))
## A B C D
## 1 1 4 7 5
## 2 2 5 8 5
## 3 3 6 9 5
df %>% mutate(D = (A+B+C)/3)
## A B C D
## 1 1 4 7 4
## 2 2 5 8 5
## 3 3 6 9 6
rowwise()
and then use mutate()
.
df %>% rowwise() %>% mutate(D= mean(c(A,B,C)))
## # A tibble: 3 × 4
## # Rowwise:
## A B C D
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4 7 4
## 2 2 5 8 5
## 3 3 6 9 6
See more uses of rowwise()
operations here.
Group_by and summarize
There is an approach to data analysis known as "split-apply-combine", in which the data is split into smaller components, some type of analysis is applied to each component, and the results are combined. The dplyr
functions including group_by()
and summarize()
are key players in this type of workflow.
group_by()
allows us to group a data frame by a categorical variable so that a given operation can be performed per group / category.
Let's get the top five genes with the greatest median raw counts by treatment.
#Call the data
acount_smeta %>%
# group_by dex and Feature (Feature nested within treatment)
group_by(dex,Feature) %>%
#for each group calculate the median value of raw counts
summarize(median_counts=median(Count)) %>%
#arrange in descending order
arrange(desc(median_counts),.by_group = TRUE) %>%
#return the top 5 values for each group
slice_head(n=5)
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
## # A tibble: 10 × 3
## # Groups: dex [2]
## dex Feature median_counts
## <chr> <chr> <dbl>
## 1 trt ENSG00000115414 322164
## 2 trt ENSG00000011465 263587
## 3 trt ENSG00000156508 239676.
## 4 trt ENSG00000198804 230992
## 5 trt ENSG00000116260 187288.
## 6 untrt ENSG00000011465 336076
## 7 untrt ENSG00000115414 302956.
## 8 untrt ENSG00000156508 294097
## 9 untrt ENSG00000164692 249846
## 10 untrt ENSG00000198804 249206
#can skip arrange and use slice_max
acount_smeta %>%
group_by(dex,Feature) %>%
summarize(median_counts=median(Count)) %>%
slice_max(n=5, order_by=median_counts) #notice use of slice_max
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
## # A tibble: 10 × 3
## # Groups: dex [2]
## dex Feature median_counts
## <chr> <chr> <dbl>
## 1 trt ENSG00000115414 322164
## 2 trt ENSG00000011465 263587
## 3 trt ENSG00000156508 239676.
## 4 trt ENSG00000198804 230992
## 5 trt ENSG00000116260 187288.
## 6 untrt ENSG00000011465 336076
## 7 untrt ENSG00000115414 302956.
## 8 untrt ENSG00000156508 294097
## 9 untrt ENSG00000164692 249846
## 10 untrt ENSG00000198804 249206
How many rows per sample are in the acount_smeta data frame? We can use count()
or summarize()
paired with other functions (e.g., n()
,tally()
).
acount_smeta %>%
count(dex, Sample)
## # A tibble: 8 × 3
## dex Sample n
## <chr> <chr> <int>
## 1 trt SRR1039509 64102
## 2 trt SRR1039513 64102
## 3 trt SRR1039517 64102
## 4 trt SRR1039521 64102
## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102
acount_smeta %>%
group_by(dex, Sample) %>%
summarize(n=n()) #there are multiple functions that can be used here
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 3
## # Groups: dex [2]
## dex Sample n
## <chr> <chr> <int>
## 1 trt SRR1039509 64102
## 2 trt SRR1039513 64102
## 3 trt SRR1039517 64102
## 4 trt SRR1039521 64102
## 5 untrt SRR1039508 64102
## 6 untrt SRR1039512 64102
## 7 untrt SRR1039516 64102
## 8 untrt SRR1039520 64102
This output makes sense, as there are 64,102 unique Ensembl ids (n_distinct(acount_smeta$Feature)
)
Missing Values
By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org
Let's see this in practice
#This is used to get the same result
#with a pseudorandom number generator like sample()
set.seed(138)
#make mock data frame
fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
counts=sample(1:500,12,TRUE)) %>%
#Assign NAs if the value is less than 100. This is arbitrary.
mutate(counts=replace(counts, counts<100, NA))
#let's view
fun_df
## genes counts
## 1 A NA
## 2 A 214
## 3 A NA
## 4 B 352
## 5 B 256
## 6 B NA
## 7 C 400
## 8 C 381
## 9 C 250
## 10 D 278
## 11 D NA
## 12 D 169
#Summarize mean, median, min, and max
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts),
median_count = median(counts),
min_count = min(counts),
max_count = max(counts))
## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <int> <int> <int>
## 1 A NA NA NA NA
## 2 B NA NA NA NA
## 3 C 344. 381 250 400
## 4 D NA NA NA NA
#use na.rm
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts, na.rm=TRUE),
median_count = median(counts, na.rm=TRUE),
min_count = min(counts, na.rm=TRUE),
max_count = max(counts, na.rm=TRUE))
## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <dbl> <int> <int>
## 1 A 214 214 214 214
## 2 B 304 304 256 352
## 3 C 344. 381 250 400
## 4 D 224. 224. 169 278
Similar to mutate, we can summarize across multiple columns at once using across()
. For example, let's get the mean of logFC
and logCPM
.
dexp %>%
summarize(across(starts_with("Log"), mean))
## # A tibble: 1 × 2
## logFC logCPM
## <dbl> <dbl>
## 1 -0.00859 3.71
Exporting files using the write
functions
We have learned how to import data using the read functions, but how can we export / write out data? We can use a series of write functions. Some examples from readr
include write_csv()
, write_delim()
, write_tsv()
. Some examples from base R include write.csv()
, write.table()
, writeLines()
.
Let's export a tab delimited file containing acount_smeta
.
write_tsv(acount_smeta, "countsANDmeta.txt", quote="none")
Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Chapter 13, Relational Data, from R for Data Science and Suzan Baert's dplyr tutorials.