Summarizing Data with dplyr

Objectives.

This lesson will introduce the "split-apply-combine" approach to data analysis and the key players in the dplyr package used to implement this type of workflow:
- group_by()
- summarize()
We will also learn about other useful dplyr functions including
- arrange()
- distinct()

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

Load Tidyverse

In this lesson, we are continuing with the package dplyr. We do not need to load the dplyr package separately, as it is a core tidyverse package. Again, if you need to install and load only dplyr, use install.packages("dplyr") and library(dplyr).

Load the package:

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the data

Let's load in some data to work with. In this lesson, we will continue to use sample metadata, raw count data, and differential expression results derived from the airway RNA-Seq project.

Get the sample metadata:

#sample information  
smeta<-read_delim("./data/airway_sampleinfo.txt")

Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Get the raw counts:

#raw count data
acount<-read_csv("./data/airway_rawcount.csv") %>%
  dplyr::rename("Feature" = "...1")

New names:
Rows: 64102 Columns: 9
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
SRR1039...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Here we used read_csv and rename to load the raw count data. Remember, rename allows us to rename any column without selection.

Get the differential expression results:

#differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")

Rows: 15926 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Group_by and summarize

There is an approach to data analysis known as "split-apply-combine", in which the data are split into smaller components, some type of analysis is applied to each component, and the results are combined. The dplyr functions including group_by() and summarize() are key players in this type of workflow.

Before diving into this further, let's create some more interesting data to work with by merging our count matrix with our sample metadata.

acount_smeta <- acount %>% 
  pivot_longer(where(is.numeric), names_to ="Sample", 
               values_to= "Count") %>% #reshape the data
  left_join(smeta, by=c("Sample"="Run")) #join with meta data

acount_smeta

# A tibble: 512,816 × 11
   Feature        Sample Count SampleName cell  dex   albut avgLength Experiment
   <chr>          <chr>  <dbl> <chr>      <chr> <chr> <chr>     <dbl> <chr>     
 1 ENSG000000000… SRR10…   679 GSM1275862 N613… untrt untrt       126 SRX384345 
 2 ENSG000000000… SRR10…   448 GSM1275863 N613… trt   untrt       126 SRX384346 
 3 ENSG000000000… SRR10…   873 GSM1275866 N052… untrt untrt       126 SRX384349 
 4 ENSG000000000… SRR10…   408 GSM1275867 N052… trt   untrt        87 SRX384350 
 5 ENSG000000000… SRR10…  1138 GSM1275870 N080… untrt untrt       120 SRX384353 
 6 ENSG000000000… SRR10…  1047 GSM1275871 N080… trt   untrt       126 SRX384354 
 7 ENSG000000000… SRR10…   770 GSM1275874 N061… untrt untrt       101 SRX384357 
 8 ENSG000000000… SRR10…   572 GSM1275875 N061… trt   untrt        98 SRX384358 
 9 ENSG000000000… SRR10…     0 GSM1275862 N613… untrt untrt       126 SRX384345 
10 ENSG000000000… SRR10…     0 GSM1275863 N613… trt   untrt       126 SRX384346 
# ℹ 512,806 more rows
# ℹ 2 more variables: Sample.y <chr>, BioSample <chr>

left_join()

left_join() is a mutating join function from dplyr. We will learn more about this function in the next lesson. Don't dwell on the code too much here.

Key Functions

Here we are interested in functions that allow us to summarize our data. These include.

group_by() - group a data frame by a categorical variable so that a given operation can be performed per group / category. The data frame will not be reorganized, but it will have a grouping attribute, which will impact how tidyverse functions interact with it.
summarize() - computes summary statistics (1 or more) in a data frame. This function creates a new data frame, returning one row for each combination of grouping variables. If there are no grouping variables, the output will have a single row summarizing all observations in the input. See ?summarize.

The syntax:
summarize(new_column = operations_on_existing_columns)

where new_column is the name of the new column to appear in the resulting summary table, and operations_on_existing_columns is where we apply summary functions to an existing column to create what will go in new_column. This should return a single value. To return more than one value per group, see ?reframe().

summarize may include multiple new_column = operations_on_existing_columns statements, with each statement separated by ,. We will see a similar syntax with mutate.
count() - computes groupwise counts. This does not require group_by.
ungroup() - removes the grouping criteria set by group_by(). This is useful for performing additional operations that you do not want applied by group.

.by

summarize() can provide results by group without group_by using the .by argument.

For example,

Let's compute the median raw counts for each gene by treatment.

#Call the data 
medcount<- acount_smeta %>% 
  # group_by dex and Feature (Feature nested within treatment)
  group_by(dex,Feature) %>%
  #for each group calculate the median value of raw counts
  summarize(median_counts=median(Count))

`summarise()` has grouped output by 'dex'. You can override using the `.groups`
argument.

medcount

# A tibble: 128,204 × 3
# Groups:   dex [2]
   dex   Feature         median_counts
   <chr> <chr>                   <dbl>
 1 trt   ENSG00000000003         510  
 2 trt   ENSG00000000005           0  
 3 trt   ENSG00000000419         512. 
 4 trt   ENSG00000000457         220  
 5 trt   ENSG00000000460          57.5
 6 trt   ENSG00000000938           0  
 7 trt   ENSG00000000971        6124. 
 8 trt   ENSG00000001036        1086. 
 9 trt   ENSG00000001084         598. 
10 trt   ENSG00000001167         252. 
# ℹ 128,194 more rows

Using summarize(), by default the output is grouped by every grouping column except the last (e.g., here, no longer grouped by "Feature"), which is helpful for performing additional operations at higher levels of grouping (e.g., "dex").

Now, let's obtain the top five genes with the greatest median raw counts by treatment using slice_max. Remember, medcount has grouped output by dex. This grouping is maintained unless ungroup was applied.

medcount  %>% 
  slice_max(n=5, order_by=median_counts) #notice use of slice_max

# A tibble: 10 × 3
# Groups:   dex [2]
   dex   Feature         median_counts
   <chr> <chr>                   <dbl>
 1 trt   ENSG00000115414       322164 
 2 trt   ENSG00000011465       263587 
 3 trt   ENSG00000156508       239676.
 4 trt   ENSG00000198804       230992 
 5 trt   ENSG00000116260       187288.
 6 untrt ENSG00000011465       336076 
 7 untrt ENSG00000115414       302956.
 8 untrt ENSG00000156508       294097 
 9 untrt ENSG00000164692       249846 
10 untrt ENSG00000198804       249206

Often we are interested in knowing more about sample sizes and including that information in summary output. For example, how many rows per sample are in the acount_smeta data frame? We can use count() or summarize() paired with other functions (e.g., n(),tally()).

acount_smeta  %>% 
  count(dex, Sample)

# A tibble: 8 × 3
  dex   Sample         n
  <chr> <chr>      <int>
1 trt   SRR1039509 64102
2 trt   SRR1039513 64102
3 trt   SRR1039517 64102
4 trt   SRR1039521 64102
5 untrt SRR1039508 64102
6 untrt SRR1039512 64102
7 untrt SRR1039516 64102
8 untrt SRR1039520 64102

acount_smeta  %>% 
  group_by(dex, Sample) %>% 
  summarize(n=n()) #there are multiple functions that can be used here

`summarise()` has grouped output by 'dex'. You can override using the `.groups`
argument.

# A tibble: 8 × 3
# Groups:   dex [2]
  dex   Sample         n
  <chr> <chr>      <int>
1 trt   SRR1039509 64102
2 trt   SRR1039513 64102
3 trt   SRR1039517 64102
4 trt   SRR1039521 64102
5 untrt SRR1039508 64102
6 untrt SRR1039512 64102
7 untrt SRR1039516 64102
8 untrt SRR1039520 64102

This output makes sense, as there are 64,102 unique Ensembl ids (See n_distinct(acount_smeta$Feature)).

na.rm

By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org

Let's see this in practice

#This is used to get the same result
#with a pseudorandom number generator like sample()
set.seed(138) 

#make mock data frame
fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
                   counts=sample(1:500,12,TRUE)) %>% 
  #Assign NAs if the value is less than 100. This is arbitrary. 
  mutate(counts=replace(counts, counts<100, NA))

#let's view
fun_df

   genes counts
1      A     NA
2      A    214
3      A     NA
4      B    352
5      B    256
6      B     NA
7      C    400
8      C    381
9      C    250
10     D    278
11     D     NA
12     D    169

#Summarize mean, median, min, and max
fun_df %>% 
  group_by(genes) %>%
  summarize(
    mean_count = mean(counts),
    median_count = median(counts),
    min_count = min(counts),
    max_count = max(counts))

# A tibble: 4 × 5
  genes mean_count median_count min_count max_count
  <chr>      <dbl>        <int>     <int>     <int>
1 A            NA            NA        NA        NA
2 B            NA            NA        NA        NA
3 C           344.          381       250       400
4 D            NA            NA        NA        NA

#use na.rm
fun_df %>% 
  group_by(genes) %>%
  summarize(
    mean_count = mean(counts, na.rm=TRUE),
    median_count = median(counts, na.rm=TRUE),
    min_count = min(counts, na.rm=TRUE),
    max_count = max(counts, na.rm=TRUE))

# A tibble: 4 × 5
  genes mean_count median_count min_count max_count
  <chr>      <dbl>        <dbl>     <int>     <int>
1 A           214          214        214       214
2 B           304          304        256       352
3 C           344.         381        250       400
4 D           224.         224.       169       278

Lastly, similar to mutate, we can summarize across multiple columns at once using across(). We will focus more heavily on across() next lesson. Let's get the mean of logFC and logCPM.

dexp %>% 
  summarize(across(starts_with("Log"), mean))

# A tibble: 1 × 2
     logFC logCPM
     <dbl>  <dbl>
1 -0.00859   3.71

Additional Examples

Let's use penguins for additional practice.

The penguins data contains

Data on adult penguins covering three species found on three islands in the Palmer Archipelago, Antarctica, including their size (flipper length, body mass, bill dimensions), and sex. - penguins docs

Let's summarize these data by finding the mean penguin body mass by penguin species. Remember to include na.rm = TRUE to exclude missing values.

penguins %>% 
  group_by(species) %>%
  summarize(mean_mass = mean(body_mass, na.rm = TRUE))

# A tibble: 3 × 2
  species   mean_mass
  <fct>         <dbl>
1 Adelie        3701.
2 Chinstrap     3733.
3 Gentoo        5076.

What if we also want to include the standard deviation by species?

penguins %>%
  group_by(species) %>%
  summarize(mean_mass = mean(body_mass, na.rm = TRUE),
            sd_mass = sd(body_mass, na.rm = TRUE))

# A tibble: 3 × 3
  species   mean_mass sd_mass
  <fct>         <dbl>   <dbl>
1 Adelie        3701.    459.
2 Chinstrap     3733.    384.
3 Gentoo        5076.    504.

Looking for more functions to use with summarize? Here are some useful summary functions. However, the use of summarize() is not limited to these suggestions.

Reordering rows with arrange()

In the tidyverse, reordering rows is largely done by arrange(). Arrange will reorder a variable from smallest to largest, or in the case of characters, alphabetically, from a to z. This is in ascending order.

arrange() will break ties using additionally supplied columns for ordering. It will also mostly ignore grouping. To order by group, use .by_group = TRUE.

Let's arrange the genes in dexp.

dexp %>% arrange(transcript)

# A tibble: 15,926 × 10
   feature   albut transcript ref_genome .abundant   logFC logCPM      F  PValue
   <chr>     <chr> <chr>      <chr>      <lgl>       <dbl>  <dbl>  <dbl>   <dbl>
 1 ENSG0000… untrt A1BG-AS1   hg38       TRUE       0.513   1.02   9.22  1.45e-2
 2 ENSG0000… untrt A2M        hg38       TRUE       0.528  10.1    3.57  9.24e-2
 3 ENSG0000… untrt A2M-AS1    hg38       TRUE      -0.337   0.308  2.76  1.32e-1
 4 ENSG0000… untrt A4GALT     hg38       TRUE       0.519   5.89  24.5   8.54e-4
 5 ENSG0000… untrt AAAS       hg38       TRUE      -0.0254  5.12   0.134 7.23e-1
 6 ENSG0000… untrt AACS       hg38       TRUE      -0.191   4.06   5.00  5.30e-2
 7 ENSG0000… untrt AADAT      hg38       TRUE      -0.642   2.67  16.9   2.76e-3
 8 ENSG0000… untrt AAGAB      hg38       TRUE      -0.165   5.08   5.82  3.98e-2
 9 ENSG0000… untrt AAK1       hg38       TRUE      -0.188   3.82   2.29  1.66e-1
10 ENSG0000… untrt AAMDC      hg38       TRUE       0.447   2.42   8.52  1.75e-2
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>

Let's arrange logFC from smallest to largest.

dexp %>% arrange(logFC)

# A tibble: 15,926 × 10
   feature     albut transcript ref_genome .abundant logFC  logCPM     F  PValue
   <chr>       <chr> <chr>      <chr>      <lgl>     <dbl>   <dbl> <dbl>   <dbl>
 1 ENSG000002… untrt LINC00906  hg38       TRUE      -4.59  0.473  139.  1.13e-6
 2 ENSG000001… untrt LRRTM2     hg38       TRUE      -4.00  1.24   127.  1.64e-6
 3 ENSG000001… untrt VASH2      hg38       TRUE      -3.95  0.0171 152.  7.77e-7
 4 ENSG000001… untrt VCAM1      hg38       TRUE      -3.66  4.60   565.  2.87e-9
 5 ENSG000001… untrt SLC14A1    hg38       TRUE      -3.63  1.38    42.3 1.25e-4
 6 ENSG000002… untrt FER1L6     hg38       TRUE      -3.13  3.53   238.  1.18e-7
 7 ENSG000001… untrt SMTNL2     hg38       TRUE      -3.12  1.46   134.  1.29e-6
 8 ENSG000001… untrt WNT2       hg38       TRUE      -3.07  3.99   521.  4.09e-9
 9 ENSG000001… untrt EGR2       hg38       TRUE      -3.04 -0.141   96.1 5.11e-6
10 ENSG000001… untrt SLITRK6    hg38       TRUE      -3.03  1.16   130.  1.46e-6
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>

What if we want to arrange from largest to smallest (in descending order)? We can use desc().

dexp %>% arrange(desc(logFC))

# A tibble: 15,926 × 10
   feature    albut transcript ref_genome .abundant logFC  logCPM     F   PValue
   <chr>      <chr> <chr>      <chr>      <lgl>     <dbl>   <dbl> <dbl>    <dbl>
 1 ENSG00000… untrt ALOX15B    hg38       TRUE      10.1   1.62    554. 5.92e- 7
 2 ENSG00000… untrt ZBTB16     hg38       TRUE       7.15  4.15   1429. 5.11e-11
 3 ENSG00000… untrt <NA>       <NA>       TRUE       6.17  1.35    380. 1.58e- 8
 4 ENSG00000… untrt ANGPTL7    hg38       TRUE       5.68  3.51    483. 5.66e- 9
 5 ENSG00000… untrt STEAP4     hg38       TRUE       5.22  3.66    445. 8.07e- 9
 6 ENSG00000… untrt PRODH      hg38       TRUE       4.85  1.29    253. 9.10e- 8
 7 ENSG00000… untrt FAM107A    hg38       TRUE       4.74  2.78    656. 1.51e- 9
 8 ENSG00000… untrt LGI3       hg38       TRUE       4.68 -0.0503  106. 3.45e- 6
 9 ENSG00000… untrt SPARCL1    hg38       TRUE       4.56  5.53    721. 1.00e- 9
10 ENSG00000… untrt KLF15      hg38       TRUE       4.48  4.69    479. 5.86e- 9
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>

Note

If you include more than one column to order by descending values, each column needs to be wrapped with desc().

Additional useful functions

distinct() - return distinct combinations of values

acount_smeta %>% distinct(Sample)

# A tibble: 8 × 1
  Sample    
  <chr>     
1 SRR1039508
2 SRR1039509
3 SRR1039512
4 SRR1039513
5 SRR1039516
6 SRR1039517
7 SRR1039520
8 SRR1039521

n_distinct() - "counts the number of unique/distinct combinations in a set of one or more vectors."

Acknowledgments

Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Suzan Baert's dplyr tutorials and Allison Horst's tutorial "Wrangling penguins: some basic data wrangling in R with dplyr".