Summarizing Data with dplyr
Objectives.
-
This lesson will introduce the "split-apply-combine" approach to data analysis and the key players in the
dplyr
package used to implement this type of workflow:group_by()
summarize()
-
We will also learn about other useful dplyr functions including
arrange()
distinct()
To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.
Load Tidyverse
In this lesson, we are continuing with the package dplyr
. We do not need to load the dplyr
package separately, as it is a core tidyverse
package. Again, if you need to install and load only dplyr
, use install.packages("dplyr")
and library(dplyr)
.
Load the package:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the data
Let's load in some data to work with. In this lesson, we will continue to use sample metadata, raw count data, and differential expression results derived from the airway RNA-Seq project.
Get the sample metadata:
#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")
Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Get the raw counts:
#raw count data
acount<-read_csv("./data/airway_rawcount.csv") %>%
dplyr::rename("Feature" = "...1")
New names:
Rows: 64102 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
SRR1039...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Here we used read_csv
and rename
to load the raw count data. Remember, rename
allows us to rename any column without selection.
Get the differential expression results:
#differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")
Rows: 15926 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Group_by and summarize
There is an approach to data analysis known as "split-apply-combine", in which the data are split into smaller components, some type of analysis is applied to each component, and the results are combined. The dplyr
functions including group_by()
and summarize()
are key players in this type of workflow.
Before diving into this further, let's create some more interesting data to work with by merging our count matrix with our sample metadata.
acount_smeta <- acount %>%
pivot_longer(where(is.numeric), names_to ="Sample",
values_to= "Count") %>% #reshape the data
left_join(smeta, by=c("Sample"="Run")) #join with meta data
acount_smeta
# A tibble: 512,816 × 11
Feature Sample Count SampleName cell dex albut avgLength Experiment
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl> <chr>
1 ENSG000000000… SRR10… 679 GSM1275862 N613… untrt untrt 126 SRX384345
2 ENSG000000000… SRR10… 448 GSM1275863 N613… trt untrt 126 SRX384346
3 ENSG000000000… SRR10… 873 GSM1275866 N052… untrt untrt 126 SRX384349
4 ENSG000000000… SRR10… 408 GSM1275867 N052… trt untrt 87 SRX384350
5 ENSG000000000… SRR10… 1138 GSM1275870 N080… untrt untrt 120 SRX384353
6 ENSG000000000… SRR10… 1047 GSM1275871 N080… trt untrt 126 SRX384354
7 ENSG000000000… SRR10… 770 GSM1275874 N061… untrt untrt 101 SRX384357
8 ENSG000000000… SRR10… 572 GSM1275875 N061… trt untrt 98 SRX384358
9 ENSG000000000… SRR10… 0 GSM1275862 N613… untrt untrt 126 SRX384345
10 ENSG000000000… SRR10… 0 GSM1275863 N613… trt untrt 126 SRX384346
# ℹ 512,806 more rows
# ℹ 2 more variables: Sample.y <chr>, BioSample <chr>
left_join()
left_join()
is a mutating join function from dplyr
. We will learn more about this function in the next lesson. Don't dwell on the code too much here.
Key Functions
Here we are interested in functions that allow us to summarize our data. These include.
group_by()
- group a data frame by a categorical variable so that a given operation can be performed per group / category. The data frame will not be reorganized, but it will have a grouping attribute, which will impact how tidyverse functions interact with it.-
summarize()
- computes summary statistics (1 or more) in a data frame. This function creates a new data frame, returning one row for each combination of grouping variables. If there are no grouping variables, the output will have a single row summarizing all observations in the input. See?summarize
.The syntax:
summarize(new_column = operations_on_existing_columns)
where
new_column
is the name of the new column to appear in the resulting summary table, andoperations_on_existing_columns
is where we apply summary functions to an existing column to create what will go innew_column
. This should return a single value. To return more than one value per group, see?reframe()
.summarize
may include multiplenew_column = operations_on_existing_columns
statements, with each statement separated by,
. We will see a similar syntax withmutate
. -
count()
- computes groupwise counts. This does not requiregroup_by
. ungroup()
- removes the grouping criteria set bygroup_by()
. This is useful for performing additional operations that you do not want applied by group.
.by
summarize()
can provide results by group without group_by
using the .by
argument.
For example,
Let's compute the median raw counts for each gene by treatment.
#Call the data
medcount<- acount_smeta %>%
# group_by dex and Feature (Feature nested within treatment)
group_by(dex,Feature) %>%
#for each group calculate the median value of raw counts
summarize(median_counts=median(Count))
`summarise()` has grouped output by 'dex'. You can override using the `.groups`
argument.
medcount
# A tibble: 128,204 × 3
# Groups: dex [2]
dex Feature median_counts
<chr> <chr> <dbl>
1 trt ENSG00000000003 510
2 trt ENSG00000000005 0
3 trt ENSG00000000419 512.
4 trt ENSG00000000457 220
5 trt ENSG00000000460 57.5
6 trt ENSG00000000938 0
7 trt ENSG00000000971 6124.
8 trt ENSG00000001036 1086.
9 trt ENSG00000001084 598.
10 trt ENSG00000001167 252.
# ℹ 128,194 more rows
Using summarize()
, by default the output is grouped by every grouping column except the last (e.g., here, no longer grouped by "Feature"), which is helpful for performing additional operations at higher levels of grouping (e.g., "dex").
Now, let's obtain the top five genes with the greatest median raw counts by treatment using slice_max
. Remember, medcount
has grouped output by dex
. This grouping is maintained unless ungroup
was applied.
medcount %>%
slice_max(n=5, order_by=median_counts) #notice use of slice_max
# A tibble: 10 × 3
# Groups: dex [2]
dex Feature median_counts
<chr> <chr> <dbl>
1 trt ENSG00000115414 322164
2 trt ENSG00000011465 263587
3 trt ENSG00000156508 239676.
4 trt ENSG00000198804 230992
5 trt ENSG00000116260 187288.
6 untrt ENSG00000011465 336076
7 untrt ENSG00000115414 302956.
8 untrt ENSG00000156508 294097
9 untrt ENSG00000164692 249846
10 untrt ENSG00000198804 249206
Often we are interested in knowing more about sample sizes and including that information in summary output. For example, how many rows per sample are in the acount_smeta
data frame? We can use count()
or summarize()
paired with other functions (e.g., n()
,tally()
).
acount_smeta %>%
count(dex, Sample)
# A tibble: 8 × 3
dex Sample n
<chr> <chr> <int>
1 trt SRR1039509 64102
2 trt SRR1039513 64102
3 trt SRR1039517 64102
4 trt SRR1039521 64102
5 untrt SRR1039508 64102
6 untrt SRR1039512 64102
7 untrt SRR1039516 64102
8 untrt SRR1039520 64102
acount_smeta %>%
group_by(dex, Sample) %>%
summarize(n=n()) #there are multiple functions that can be used here
`summarise()` has grouped output by 'dex'. You can override using the `.groups`
argument.
# A tibble: 8 × 3
# Groups: dex [2]
dex Sample n
<chr> <chr> <int>
1 trt SRR1039509 64102
2 trt SRR1039513 64102
3 trt SRR1039517 64102
4 trt SRR1039521 64102
5 untrt SRR1039508 64102
6 untrt SRR1039512 64102
7 untrt SRR1039516 64102
8 untrt SRR1039520 64102
This output makes sense, as there are 64,102 unique Ensembl ids (See n_distinct(acount_smeta$Feature)
).
na.rm
By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org
Let's see this in practice
#This is used to get the same result
#with a pseudorandom number generator like sample()
set.seed(138)
#make mock data frame
fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
counts=sample(1:500,12,TRUE)) %>%
#Assign NAs if the value is less than 100. This is arbitrary.
mutate(counts=replace(counts, counts<100, NA))
#let's view
fun_df
genes counts
1 A NA
2 A 214
3 A NA
4 B 352
5 B 256
6 B NA
7 C 400
8 C 381
9 C 250
10 D 278
11 D NA
12 D 169
#Summarize mean, median, min, and max
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts),
median_count = median(counts),
min_count = min(counts),
max_count = max(counts))
# A tibble: 4 × 5
genes mean_count median_count min_count max_count
<chr> <dbl> <int> <int> <int>
1 A NA NA NA NA
2 B NA NA NA NA
3 C 344. 381 250 400
4 D NA NA NA NA
#use na.rm
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts, na.rm=TRUE),
median_count = median(counts, na.rm=TRUE),
min_count = min(counts, na.rm=TRUE),
max_count = max(counts, na.rm=TRUE))
# A tibble: 4 × 5
genes mean_count median_count min_count max_count
<chr> <dbl> <dbl> <int> <int>
1 A 214 214 214 214
2 B 304 304 256 352
3 C 344. 381 250 400
4 D 224. 224. 169 278
Lastly, similar to mutate, we can summarize across multiple columns at once using across()
. We will focus more heavily on across()
next lesson. Let's get the mean of logFC
and logCPM
.
dexp %>%
summarize(across(starts_with("Log"), mean))
# A tibble: 1 × 2
logFC logCPM
<dbl> <dbl>
1 -0.00859 3.71
Additional Examples
Let's use penguins
for additional practice.
The penguins data contains
Data on adult penguins covering three species found on three islands in the Palmer Archipelago, Antarctica, including their size (flipper length, body mass, bill dimensions), and sex. - penguins docs
Let's summarize these data by finding the mean penguin body mass by penguin species. Remember to include na.rm = TRUE
to exclude missing values.
penguins %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass, na.rm = TRUE))
# A tibble: 3 × 2
species mean_mass
<fct> <dbl>
1 Adelie 3701.
2 Chinstrap 3733.
3 Gentoo 5076.
What if we also want to include the standard deviation by species?
penguins %>%
group_by(species) %>%
summarize(mean_mass = mean(body_mass, na.rm = TRUE),
sd_mass = sd(body_mass, na.rm = TRUE))
# A tibble: 3 × 3
species mean_mass sd_mass
<fct> <dbl> <dbl>
1 Adelie 3701. 459.
2 Chinstrap 3733. 384.
3 Gentoo 5076. 504.
Looking for more functions to use with summarize? Here are some useful summary functions. However, the use of summarize()
is not limited to these suggestions.
Reordering rows with arrange()
In the tidyverse, reordering rows is largely done by arrange()
. Arrange will reorder a variable from smallest to largest, or in the case of characters, alphabetically, from a to z. This is in ascending order.
arrange()
will break ties using additionally supplied columns for ordering. It will also mostly ignore grouping. To order by group, use .by_group = TRUE
.
Let's arrange the genes in dexp
.
dexp %>% arrange(transcript)
# A tibble: 15,926 × 10
feature albut transcript ref_genome .abundant logFC logCPM F PValue
<chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
1 ENSG0000… untrt A1BG-AS1 hg38 TRUE 0.513 1.02 9.22 1.45e-2
2 ENSG0000… untrt A2M hg38 TRUE 0.528 10.1 3.57 9.24e-2
3 ENSG0000… untrt A2M-AS1 hg38 TRUE -0.337 0.308 2.76 1.32e-1
4 ENSG0000… untrt A4GALT hg38 TRUE 0.519 5.89 24.5 8.54e-4
5 ENSG0000… untrt AAAS hg38 TRUE -0.0254 5.12 0.134 7.23e-1
6 ENSG0000… untrt AACS hg38 TRUE -0.191 4.06 5.00 5.30e-2
7 ENSG0000… untrt AADAT hg38 TRUE -0.642 2.67 16.9 2.76e-3
8 ENSG0000… untrt AAGAB hg38 TRUE -0.165 5.08 5.82 3.98e-2
9 ENSG0000… untrt AAK1 hg38 TRUE -0.188 3.82 2.29 1.66e-1
10 ENSG0000… untrt AAMDC hg38 TRUE 0.447 2.42 8.52 1.75e-2
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>
Let's arrange logFC
from smallest to largest.
dexp %>% arrange(logFC)
# A tibble: 15,926 × 10
feature albut transcript ref_genome .abundant logFC logCPM F PValue
<chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
1 ENSG000002… untrt LINC00906 hg38 TRUE -4.59 0.473 139. 1.13e-6
2 ENSG000001… untrt LRRTM2 hg38 TRUE -4.00 1.24 127. 1.64e-6
3 ENSG000001… untrt VASH2 hg38 TRUE -3.95 0.0171 152. 7.77e-7
4 ENSG000001… untrt VCAM1 hg38 TRUE -3.66 4.60 565. 2.87e-9
5 ENSG000001… untrt SLC14A1 hg38 TRUE -3.63 1.38 42.3 1.25e-4
6 ENSG000002… untrt FER1L6 hg38 TRUE -3.13 3.53 238. 1.18e-7
7 ENSG000001… untrt SMTNL2 hg38 TRUE -3.12 1.46 134. 1.29e-6
8 ENSG000001… untrt WNT2 hg38 TRUE -3.07 3.99 521. 4.09e-9
9 ENSG000001… untrt EGR2 hg38 TRUE -3.04 -0.141 96.1 5.11e-6
10 ENSG000001… untrt SLITRK6 hg38 TRUE -3.03 1.16 130. 1.46e-6
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>
What if we want to arrange from largest to smallest (in descending order)? We can use desc()
.
dexp %>% arrange(desc(logFC))
# A tibble: 15,926 × 10
feature albut transcript ref_genome .abundant logFC logCPM F PValue
<chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000… untrt ALOX15B hg38 TRUE 10.1 1.62 554. 5.92e- 7
2 ENSG00000… untrt ZBTB16 hg38 TRUE 7.15 4.15 1429. 5.11e-11
3 ENSG00000… untrt <NA> <NA> TRUE 6.17 1.35 380. 1.58e- 8
4 ENSG00000… untrt ANGPTL7 hg38 TRUE 5.68 3.51 483. 5.66e- 9
5 ENSG00000… untrt STEAP4 hg38 TRUE 5.22 3.66 445. 8.07e- 9
6 ENSG00000… untrt PRODH hg38 TRUE 4.85 1.29 253. 9.10e- 8
7 ENSG00000… untrt FAM107A hg38 TRUE 4.74 2.78 656. 1.51e- 9
8 ENSG00000… untrt LGI3 hg38 TRUE 4.68 -0.0503 106. 3.45e- 6
9 ENSG00000… untrt SPARCL1 hg38 TRUE 4.56 5.53 721. 1.00e- 9
10 ENSG00000… untrt KLF15 hg38 TRUE 4.48 4.69 479. 5.86e- 9
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>
Note
If you include more than one column to order by descending values, each column needs to be wrapped with desc()
.
Additional useful functions
distinct()
- return distinct combinations of values
acount_smeta %>% distinct(Sample)
# A tibble: 8 × 1
Sample
<chr>
1 SRR1039508
2 SRR1039509
3 SRR1039512
4 SRR1039513
5 SRR1039516
6 SRR1039517
7 SRR1039520
8 SRR1039521
n_distinct()
- "counts the number of unique/distinct combinations in a set of one or more vectors."
Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Suzan Baert's dplyr tutorials and Allison Horst's tutorial "Wrangling penguins: some basic data wrangling in R with dplyr".