Skip to content

Introduction to Data Wrangling with the Tidyverse

Objectives
Wrangle data using tidyverse functionality (i.e., dplyr). To this end, you should understand:
1. how to use common dplyr functions (e.g., select(), group_by(), arrange(), mutate(), summarize(), and filter())
2. how to employ the pipe (%>%) operator to link functions
3. how to perform more complicated wrangling using the split, apply, combine concept
4. how to tidy data using tidyr.

Reminder: Uploading files from RStudio Server

Any files created by you today will be erased at the end of the session. You can upload any files you downloaded from the last session using the Upload option in the Files pane.

Best Practices for organizing genomic data**

Let's refer back to our best practices for organizing genomic data.

  1. "Keep raw data separate from analyzed data" -- datacarpentry.org

  2. "Keep spreadsheet data Tidy" -- datacarpentry.org

  3. "Trust but verify" -- datacarpentry.org

But, what is tidy data???

Introducing tidy data

What is tidy data?

Tidy data is an approach (or philosophy) to data organization and management. There are three rules to tidy data: (1) each variable forms its own column, (2) each observation forms a row, and (3) each value has its own cell. One advantage to following these rules is that the data structure remains consistent, making it easier to understand the tools that work well with the underlying structure, and there are a lot of tools in R built specifically to interact with tidy data. Equipped with the right tools will make data analysis more efficient. See the infographics below.
Tidy Data Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration
Working with Tidy data
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration

What is messy data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham.

Messy data sets tend to share five common problems:

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables. --- Hadley Wickham, Tidy Data

Let's look at a few examples of untidy data presented in Tidyverse Skills for Data Science.

Remember our data frame scaled_counts? scaled_counts is a tidy data frame. Let's take a moment to envision an untidy data frame that contains the same data. Again, remember, there are infinite possibilities for messy data, but here is one example.

The orginal data frame:

#import the data and save to an object called scaled_counts
scaled_counts<-read.delim(
  "./data/filtlowabund_scaledcounts_airways.txt", as.is=TRUE)
head(scaled_counts)
##           feature sample counts SampleName   cell   dex albut        Run
## 1 ENSG00000000003    508    679 GSM1275862 N61311 untrt untrt SRR1039508
## 2 ENSG00000000419    508    467 GSM1275862 N61311 untrt untrt SRR1039508
## 3 ENSG00000000457    508    260 GSM1275862 N61311 untrt untrt SRR1039508
## 4 ENSG00000000460    508     60 GSM1275862 N61311 untrt untrt SRR1039508
## 5 ENSG00000000971    508   3251 GSM1275862 N61311 untrt untrt SRR1039508
## 6 ENSG00000001036    508   1433 GSM1275862 N61311 untrt untrt SRR1039508
##   avgLength Experiment    Sample    BioSample transcript ref_genome .abundant
## 1       126  SRX384345 SRS508568 SAMN02422669     TSPAN6       hg38      TRUE
## 2       126  SRX384345 SRS508568 SAMN02422669       DPM1       hg38      TRUE
## 3       126  SRX384345 SRS508568 SAMN02422669      SCYL3       hg38      TRUE
## 4       126  SRX384345 SRS508568 SAMN02422669   C1orf112       hg38      TRUE
## 5       126  SRX384345 SRS508568 SAMN02422669        CFH       hg38      TRUE
## 6       126  SRX384345 SRS508568 SAMN02422669      FUCA2       hg38      TRUE
##        TMM multiplier counts_scaled
## 1 1.055278   1.415149     960.88642
## 2 1.055278   1.415149     660.87475
## 3 1.055278   1.415149     367.93883
## 4 1.055278   1.415149      84.90896
## 5 1.055278   1.415149    4600.65058
## 6 1.055278   1.415149    2027.90904
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Adding missing grouping variables: `sample`

An untidy version (subset) of scaled_counts.

c # view a snapshot of an untidy data frame
##                                                         508
## cell                                                 N61311
## dex                                                   untrt
## SampleName                                       GSM1275862
## Run / Experiment / Accession SRR1039508;SRX384345;SRS508568
## TSPAN6                                     960.886417275434
## DPM1                                       660.874752382368
## SCYL3                                      367.938834302817
## C1orf112                                   84.9089617621886
## CFH                                        4600.65057814792
##                                                         509
## cell                                                 N61311
## dex                                                     trt
## SampleName                                       GSM1275863
## Run / Experiment / Accession SRR1039509;SRX384346;SRS508567
## TSPAN6                                     716.779730254346
## DPM1                                       823.976698841491
## SCYL3                                      337.590453311757
## C1orf112                                   87.9975115267612
## CFH                                        5886.23354376281
##                                                         512
## cell                                                N052611
## dex                                                   untrt
## SampleName                                       GSM1275866
## Run / Experiment / Accession SRR1039512;SRX384349;SRS508571
## TSPAN6                                     1075.40953718585
## DPM1                                       764.982041915709
## SCYL3                                      323.977901809712
## C1orf112                                   49.2742055984354
## CFH                                        7609.16919953838
##                                                         513
## cell                                                N052611
## dex                                                     trt
## SampleName                                       GSM1275867
## Run / Experiment / Accession SRR1039513;SRX384350;SRS508572
## TSPAN6                                     871.667100344899
## DPM1                                       779.800224573256
## SCYL3                                      350.375991315107
## C1orf112                                   74.7753640001752
## CFH                                        9084.13850653557
##                                                         516
## cell                                                N080611
## dex                                                   untrt
## SampleName                                       GSM1275870
## Run / Experiment / Accession SRR1039516;SRX384353;SRS508575
## TSPAN6                                     1392.02747542151
## DPM1                                       718.031746988071
## SCYL3                                      299.689570719041
## C1orf112                                   95.4113735350417
## CFH                                        8221.28001960276
##                                                         517
## cell                                                N080611
## dex                                                     trt
## SampleName                                       GSM1275871
## Run / Experiment / Accession SRR1039517;SRX384354;SRS508576
## TSPAN6                                      1074.3148432507
## DPM1                                       819.844851726182
## SCYL3                                      339.635351591197
## C1orf112                                   64.6435865566326
## CFH                                        11314.6798247617
##                                                         520
## cell                                                N061011
## dex                                                   untrt
## SampleName                                       GSM1275874
## Run / Experiment / Accession SRR1039520;SRX384357;SRS508579
## TSPAN6                                     1212.77045557059
## DPM1                                       656.786077886933
## SCYL3                                      366.981189802531
## C1orf112                                   119.702018991383
## CFH                                        8152.33750393948
##                                                         521
## cell                                                N061011
## dex                                                     trt
## SampleName                                       GSM1275875
## Run / Experiment / Accession SRR1039521;SRX384358;SRS508580
## TSPAN6                                     868.672540272425
## DPM1                                       771.478409892293
## SCYL3                                      347.772747766408
## C1orf112                                   91.1194972313732
## CFH                                        12141.6730060805

Tools for working with tidy data

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ---tidyverse.org

The core packages within tidyverse include dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purr, some of which we will use in this lesson and future lessons.

Load the tidyverse library

The tidyverse includes a collection of packages. To use these packages, we need to load the packages using the library() function.

library(tidyverse)

Data wrangling with tidyverse

While bracket notation is useful, it is not always the most readable or easy to employ, especially for beginners. This is where dplyr comes in. The dplyr package in the tidyverse world simplifies data wrangling with easy to employ and easy to understand functions specific for data manipulation in data frames.

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It was built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned. This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database that is over 100s of GB, conduct queries on it directly and pull back just what you need for analysis in R. --- datacarpentry.com

We do not need to load the dplyr package, as it is included in library(tidyverse), which we have already installed and loaded. However, if you need to install and load on your local machine you would use the following:

install.packages("dplyr") ## install
library("dplyr")
Let's read in some more data and take a look

#let's use our differential expression results
dexp<-readRDS("./data/diffexp_results_edger_airways.rds")

#We've already learned str() 
#but there is a tidyverse equivalent, glimpse()
str(dexp, give.attr=FALSE)
## tibble [15,926 × 10] (S3: tbl_df/tbl/data.frame)
##  $ feature   : chr [1:15926] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
##  $ albut     : Factor w/ 1 level "untrt": 1 1 1 1 1 1 1 1 1 1 ...
##  $ transcript: chr [1:15926] "TSPAN6" "DPM1" "SCYL3" "C1orf112" ...
##  $ ref_genome: chr [1:15926] "hg38" "hg38" "hg38" "hg38" ...
##  $ .abundant : logi [1:15926] TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ logFC     : num [1:15926] -0.3901 0.1978 0.0292 -0.1244 0.4173 ...
##  $ logCPM    : num [1:15926] 5.06 4.61 3.48 1.47 8.09 ...
##  $ F         : num [1:15926] 32.8495 6.9035 0.0969 0.3772 29.339 ...
##  $ PValue    : num [1:15926] 0.000312 0.028062 0.762913 0.554696 0.000463 ...
##  $ FDR       : num [1:15926] 0.00283 0.07701 0.84425 0.68233 0.00376 ...
glimpse(dexp) 
## Rows: 15,926
## Columns: 10
## $ feature    <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E…
## $ albut      <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
## $ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL…
## $ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg…
## $ .abundant  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ logFC      <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417…
## $ logCPM     <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,…
## $ F          <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9…
## $ PValue     <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0…
## $ FDR        <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376…

#we can see that glimpse is a little more succinct and clean
#also str() will show attributes.
#These were ignored above using give.attr=FALSE to get around package 
#dependencies
Note: We will also be returning to our sscaled data, so let's load that here, as well as a data frame called interesting_trnsc, which contains only 4 transcripts we deemed interesting.

We can use import functions from the tidyverse, which are a bit more efficient and create a data frame like object, known as a tibble.

#import sscaled
sscaled<-read_delim("data/sscaled.txt")
## Rows: 127408 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): cell, dex, transcript
## dbl (3): sample, counts, counts_scaled
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sscaled  
## # A tibble: 127,408 × 6
##    sample cell   dex       transcript counts counts_scaled
##     <dbl> <chr>  <chr>     <chr>       <dbl>         <dbl>
##  1    508 N61311 untreated TSPAN6        679         961. 
##  2    508 N61311 untreated DPM1          467         661. 
##  3    508 N61311 untreated SCYL3         260         368. 
##  4    508 N61311 untreated C1orf112       60          84.9
##  5    508 N61311 untreated CFH          3251        4601. 
##  6    508 N61311 untreated FUCA2        1433        2028. 
##  7    508 N61311 untreated GCLC          519         734. 
##  8    508 N61311 untreated NFYA          394         558. 
##  9    508 N61311 untreated STPG1         172         243. 
## 10    508 N61311 untreated NIPAL3       2112        2989. 
## # … with 127,398 more rows
#Create interesting_trnsc
keep_t<-c("CPD","EXT1","MCL1","LASP1")
interesting_trnsc<-sscaled[sscaled$transcript %in% keep_t,]
interesting_trnsc

## # A tibble: 32 × 6
##    sample cell    dex       transcript counts counts_scaled
##     <dbl> <chr>   <chr>     <chr>       <dbl>         <dbl>
##  1    508 N61311  untreated LASP1        6314         8935.
##  2    508 N61311  untreated CPD          4533         6415.
##  3    508 N61311  untreated MCL1         5216         7381.
##  4    508 N61311  untreated EXT1         6279         8886.
##  5    509 N61311  treated   LASP1        6364        10182.
##  6    509 N61311  treated   CPD          4578         7325.
##  7    509 N61311  treated   MCL1         6239         9982.
##  8    509 N61311  treated   EXT1         3550         5680.
##  9    512 N052611 untreated LASP1        7831         9647.
## 10    512 N052611 untreated CPD          8270        10187.
## # … with 22 more rows
Notice the tibble output. Tibbles easily show you the data types in each column and limit output to the screen (i.e., first 10 rows; only the columns that fit).

Subsetting with dplyr

How can we select only columns of interest and rows of interest? We can use dplyr's select() and filter().

Selecting columns

select() requires the data frame followed by the columns that we want to select or deselect as arguments.

#select the gene / transcript, logFC, and FDR corrected p-value
#first argument is the df followed by columns to select
dexp_s<-select(dexp, transcript, logFC, FDR) 

We can also select all columns, leaving out ones that do not interest us using - or !. This is helpful if the columns to keep far outweigh those to exclude.

df_exp<-select(dexp, -feature)

For readability we should move the transcript column to the front

#you can reorder columns and call a range of columns using select().
df_exp<-select(df_exp, transcript:FDR,albut)  
#Note: this also would have excluded the feature column 

We can also include helper functions such as starts_with() and ends_with(). See more helper functions with ?select().

select(df_exp, transcript, starts_with("log"), FDR)
## # A tibble: 15,926 × 4
##    transcript   logFC logCPM     FDR
##    <chr>        <dbl>  <dbl>   <dbl>
##  1 TSPAN6     -0.390    5.06 0.00283
##  2 DPM1        0.198    4.61 0.0770 
##  3 SCYL3       0.0292   3.48 0.844  
##  4 C1orf112   -0.124    1.47 0.682  
##  5 CFH         0.417    8.09 0.00376
##  6 FUCA2      -0.250    5.91 0.0186 
##  7 GCLC       -0.0581   4.84 0.794  
##  8 NFYA       -0.509    4.13 0.00126
##  9 STPG1      -0.136    3.12 0.478  
## 10 NIPAL3     -0.0500   7.04 0.695  
## # … with 15,916 more rows

Test your learning

  1. From the interesting_trnsc data frame select the following columns and save to an object: sample, dex, transcript, counts_scaled.

Possible Solution
interesting_trnsc_s<- select(interesting_trnsc, sample, dex, transcript, counts_scaled)

  1. From the 'interesting_trnsc` data frame select all columns except cell and counts.

Possible Solution
interesting_trnsc_s2<-select(interesting_trnsc,!c(cell,counts))

Filtering by row

Now let's filter the rows based on a condition. Let's look at only the treated samples in scaled_counts using the function filter(). filter() requires the df as the first argument followed by the filtering conditions.

filter(sscaled, dex == "treated") #we've seen == notation before

We can also filter using %in%

#filter for two cell lines
f_sscale<-filter(sscaled,cell %in% c("N061011", "N052611"))
#let's check that this worked
levels(factor(f_sscale$cell)) 
## [1] "N052611" "N061011"
#let's filter by keep_t from above
filter(f_sscale,transcript %in% keep_t) 

## # A tibble: 16 × 6
##    sample cell    dex       transcript counts counts_scaled
##     <dbl> <chr>   <chr>     <chr>       <dbl>         <dbl>
##  1    512 N052611 untreated LASP1        7831         9647.
##  2    512 N052611 untreated CPD          8270        10187.
##  3    512 N052611 untreated MCL1         5170         6369.
##  4    512 N052611 untreated EXT1         8503        10474.
##  5    513 N052611 treated   LASP1        5809        12411.
##  6    513 N052611 treated   CPD          7638        16318.
##  7    513 N052611 treated   MCL1         5153        11009.
##  8    513 N052611 treated   EXT1         2317         4950.
##  9    520 N061011 untreated LASP1        5766         9082.
## 10    520 N061011 untreated CPD          7067        11131.
## 11    520 N061011 untreated MCL1         4410         6946.
## 12    520 N061011 untreated EXT1         6925        10907.
## 13    521 N061011 treated   LASP1        7825        11884.
## 14    521 N061011 treated   CPD         10091        15325.
## 15    521 N061011 treated   MCL1         7338        11144.
## 16    521 N061011 treated   EXT1         3242         4923.
And we can filter using numeric columns. There are lots of options for filtering so explore the functionality a bit when you get a chance.

#get only results from counts greater than or equal to 20k
#use head to get only the first handful of rows
head(filter(f_sscale,counts_scaled >= 20000))  
## # A tibble: 6 × 6
##   sample cell    dex       transcript counts counts_scaled
##    <dbl> <chr>   <chr>     <chr>       <dbl>         <dbl>
## 1    512 N052611 untreated CSDE1       19863        24468.
## 2    512 N052611 untreated MRC2        23978        29537.
## 3    512 N052611 untreated DCN        422752       520769.
## 4    512 N052611 untreated VIM         37558        46266.
## 5    512 N052611 untreated CD44        25453        31354.
## 6    512 N052611 untreated VCL         17309        21322.
#use `|` operator 
#look at only results with named genes (not NAs) 
#and those with a log fold change greater than 2 
#and either a p-value or an FDR corrected p_value < or = to 0.01
#The comma acts as &
sig_annot_transcripts<-
  filter(df_exp, !is.na(transcript),
         abs(logFC) > 2, (PValue | FDR <= 0.01))

Test your learning

Filter the interesting_trnsc data frame to only include the following genes: MCL1 and EXT1.

Possible Solution

interesting_trnsc_f<-filter(interesting_trnsc, transcript %in% c("MCL1","EXT1"))

interesting_trnsc_f<-filter(interesting_trnsc, transcript == "MCL1" | transcript =="EXT1")

Introducing the pipe

Often we will apply multiple functions to wrangle a data frame into the state that we need it. For example, maybe you want to select and filter. What are our options? We could run one step after another, saving an object for each step, or we could nest a function within a function, but these can affect code readability and clutter our work space, making it difficult to follow what we or someone else did.

For example,

#Run one step at a time with intermediate objects. 
#We've done this a few times above
#select gene, logFC, FDR
dexp_s<-select(dexp, transcript, logFC, FDR)

#Now filter for only the genes "TSPAN6" and DPM1
#Note: we could have used %in%
tspanDpm<- filter(dexp_s, transcript == "TSPAN6" | transcript=="DPM1") 

#Nested code example; processed from inside out
tspanDpm<- filter(select(dexp, c(transcript, logFC, FDR)), 
                  transcript == "TSPAN6" | transcript=="DPM1" )

Let's explore how piping streamlines this. Piping (using %>%) allows you to employ multiple functions consecutively, while improving readability. The output of one function is passed directly to another without storing the intermediate steps as objects. You can pipe from the beginning (reading in the data) all the way to plotting without storing the data or intermediate objects, if you want. Pipes in R come from the magrittr package, which is a dependency of dplyr.

To pipe, we have to first call the data and then pipe it into a function. The output of each step is then piped into the next step.

Let's see how this works

tspanDpm <- dexp %>% #call the data and pipe to select()
  select(transcript, logFC, FDR) %>% #select columns of interest 
  filter(transcript == "TSPAN6" | transcript=="DPM1" ) #filter 

Notice that the data argument has been dropped from select() and filter(). This is because the pipe passes the input from the left to the right. The %>% must be at the end of each line.

Piping from the beginning:

readRDS("./data/diffexp_results_edger_airways.rds") %>% #read data
  select(transcript, logFC, FDR) %>% #select columns of interest 
  filter(transcript == "TSPAN6" | transcript=="DPM1" ) %>% #filter 
  ggplot(aes(x=transcript,y=logFC,fill=FDR)) + #plot
  geom_bar(stat = "identity") +
  theme_classic() + 
  geom_hline(yintercept=0, linetype="dashed", color = "black")

The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames. ---datacarpentry.org

Test your learning

Using what you have learned about select() and filter(), create a subsetted data frame from scaled_counts that only includes the columns 'sample', 'cell', 'dex', 'transcript', and 'counts_scaled' and only rows that include the treatment "untrt" and the transcripts "ACTN1" and "ANAPC4"?

Possible Solution
scaled_counts %>% select(sample,cell,dex,transcript,counts_scaled) %>%
  filter(dex=="untrt", transcript %in% c("ACTN1","ANAPC4"))

Mutate and transmute

Other useful data manipulation functions from dplyr include mutate() and transmute(). These functions allow you to create a new variable from existing variables. Perhaps you want to know the ratio of two columns or convert the units of a variable. That can be done with mutate().

mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. --- dplyr.tidyverse.org

Let's create a column in our original differential expression data frame denoting significant transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater than or equal to 2).

dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)

The conditional evaluates to a logical vector (contains TRUE or FALSE).

We can also use mutate to coerce variables. Remember those one liners we used in the factor section to coerce our character vectors to factors?

To mutate across multiple columns, we need to use the function across() with the select helper where(). (Note:across() has superseded the use of mutate_if, _at, _all). For more information on across() see this reference article.

#view sscaled
glimpse(sscaled)  
## Rows: 127,408
## Columns: 6
## $ sample        <dbl> 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 5…
## $ cell          <chr> "N61311", "N61311", "N61311", "N61311", "N61311", "N6131…
## $ dex           <chr> "untreated", "untreated", "untreated", "untreated", "unt…
## $ transcript    <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "…
## $ counts        <dbl> 679, 467, 260, 60, 3251, 1433, 519, 394, 172, 2112, 524,…
## $ counts_scaled <dbl> 960.88642, 660.87475, 367.93883, 84.90896, 4600.65058, 2…
#use mutate with across and select helpers
ex_coerce<-sscaled %>% mutate(across(where(is.character),as.factor)) 
glimpse(ex_coerce) 
## Rows: 127,408
## Columns: 6
## $ sample        <dbl> 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 5…
## $ cell          <fct> N61311, N61311, N61311, N61311, N61311, N61311, N61311, …
## $ dex           <fct> untreated, untreated, untreated, untreated, untreated, u…
## $ transcript    <fct> TSPAN6, DPM1, SCYL3, C1orf112, CFH, FUCA2, GCLC, NFYA, S…
## $ counts        <dbl> 679, 467, 260, 60, 3251, 1433, 519, 394, 172, 2112, 524,…
## $ counts_scaled <dbl> 960.88642, 660.87475, 367.93883, 84.90896, 4600.65058, 2…

Test your learning

Using mutate apply a base-10 logarithmic transformation to the counts_scaled column of sscaled. Save the resulting data frame to an object called log10counts. Hint: see the function log10().

Possible Solution

log10counts<-sscaled %>% mutate(logCounts=log10(counts_scaled))

Arrange, group_by, summarize

There is an approach to data analysis known as "split-apply-combine", in which the data is split into smaller components, some type of analysis is applied to each component, and the results are combined. The dplyr functions including group_by() and summarize() are key players in this type of workflow. The function arrange() may also be handy.

group_by() allows us to group a data frame by a categorical variable so that a given operation can be performed per group.

Let's get the median counts_scaled by transcript within a treatment.When you reduce the size of a data set through a calculation, you need to use summarize().

scaled_counts %>% #Call the data 
  group_by(dex,transcript) %>% # group_by treatment and transcript 
  #(transcript nested within treatment)
  summarize(median_counts=median(counts_scaled))
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.

## # A tibble: 29,152 × 3
## # Groups:   dex [2]
##    dex   transcript median_counts
##    <chr> <chr>              <dbl>
##  1 trt   A1BG-AS1            80.2
##  2 trt   A2M              37821. 
##  3 trt   A2M-AS1             24.2
##  4 trt   A4GALT            2043. 
##  5 trt   AAAS              1086. 
##  6 trt   AACS               481. 
##  7 trt   AADAT              154. 
##  8 trt   AAGAB              984. 
##  9 trt   AAK1               399. 
## 10 trt   AAMDC              181. 
## # … with 29,142 more rows
The output is a grouped data frame.

Now, if we want the top five transcripts with the greatest median scaled counts by treatment, we need to organize our data frame and then return the top rows. We can use arrange() to arrange our data frame by median_counts. If we want to arrange from highest to lowest value, we will additionally need to use desc(). The .by_group allows us to arrange by median counts within a grouping. By including slice_head() we can return the top five values by group.

a<-scaled_counts %>% #Call the data 
  group_by(dex,transcript) %>% # group_by treatment and transcript 
  #(transcript nested within treatment)
  summarize(median_counts=median(counts_scaled)) %>% #for each group 
  #calculate the median value of scaled counts
  arrange(desc(median_counts),.by_group = TRUE) %>% 
  #arrange in descending order
  slice_head(n=5) #return the top 5 values for each group
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
#can skip arrange and use slice_max
b<-scaled_counts %>% 
  group_by(dex,transcript) %>% 
  summarize(median_counts=median(counts_scaled))  %>% 
  slice_max(n=5, order_by=median_counts) #notice use of slice_max
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.

How many rows per sample are in the scaled_counts data frame?

scaled_counts  %>% 
  group_by(dex, sample) %>% 
  summarize(n=n()) #there are multiple functions that can be used here 
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 3
## # Groups:   dex [2]
##   dex   sample     n
##   <chr>  <int> <int>
## 1 trt      509 15926
## 2 trt      513 15926
## 3 trt      517 15926
## 4 trt      521 15926
## 5 untrt    508 15926
## 6 untrt    512 15926
## 7 untrt    516 15926
## 8 untrt    520 15926
#See count(); can also use tally()
scaled_counts %>% 
  count(dex,sample)
##     dex sample     n
## 1   trt    509 15926
## 2   trt    513 15926
## 3   trt    517 15926
## 4   trt    521 15926
## 5 untrt    508 15926
## 6 untrt    512 15926
## 7 untrt    516 15926
## 8 untrt    520 15926

Note: By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org

Let's see this in practice

set.seed(138) #This is used to get the same result
#with a pseudorandom number generator like sample()

#make mock data frame
fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
                   counts=sample(1:500,12,TRUE)) 

#Assign NAs if the value is less than 100. This is arbitrary.
fun_df<-fun_df %>% mutate(counts=ifelse(counts<100,NA,counts))

fun_df #view
##    genes counts
## 1      A     NA
## 2      A    214
## 3      A     NA
## 4      B    352
## 5      B    256
## 6      B     NA
## 7      C    400
## 8      C    381
## 9      C    250
## 10     D    278
## 11     D     NA
## 12     D    169
#We should get NAs returned for some of our genes
fun_df %>% 
  group_by(genes) %>%
  summarize(
    mean_count = mean(counts),
    median_count = median(counts),
    min_count = min(counts),
    max_count = max(counts))
## # A tibble: 4 × 5
##   genes mean_count median_count min_count max_count
##   <chr>      <dbl>        <int>     <int>     <int>
## 1 A            NA            NA        NA        NA
## 2 B            NA            NA        NA        NA
## 3 C           344.          381       250       400
## 4 D            NA            NA        NA        NA
#Now let's use na.rm
fun_df %>% 
  group_by(genes) %>%
  summarize(
    mean_count = mean(counts, na.rm=TRUE),
    median_count = median(counts, na.rm=TRUE),
    min_count = min(counts, na.rm=TRUE),
    max_count = max(counts, na.rm=TRUE))
## # A tibble: 4 × 5
##   genes mean_count median_count min_count max_count
##   <chr>      <dbl>        <dbl>     <int>     <int>
## 1 A           214          214        214       214
## 2 B           304          304        256       352
## 3 C           344.         381        250       400
## 4 D           224.         224.       169       278

Test your learning

Create a data frame summarizing the mean counts_scaled by sample from the scaled_counts data frame.

Possible Solution

scaled_counts %>% group_by(sample) %>%
  summarize(Mean_counts_scaled=mean(counts_scaled))

Data Reshaping

Tidy data implies that we have one observation per row and one variable per column. This generally means data is in a long format. However, whether data is tidy or not will depend on what we consider a variable versus an observation, so wide data sets may also be tidy. Often if data is in a wide format, data related to the same measurement is distributed in different columns. This at times will mean the data looks more like a data matrix; though, it may not necessarily be a matrix.

Let's look at some grocery data.

grocery <- data.frame(fruit=c("apples", "oranges", "bananas", "pears"),
                  hyvee=c(50,25,75,30),publix=c(25,75,75,60),
                  kroger=c(35,80,25,15),safeway=c(45,45,35,55))

Formatted here, this data is in a wide format. We can see an initial column with fruit (i.e., apples, oranges, bananas, pears) followed by four columns containing integer data related to grocery stores. These columns (hyvee, publix, kroger, and safeway) all hold data of the same type, collected in the same way. Based on our rules, this isn't really tidy data, but if we want to plot fruit counts in one grocery store against another, this is the best data format. Note: I'm not sure what information you gain from this type of plot. For example,

ggplot(grocery)+
  geom_point(aes(x=hyvee,y=publix,color=fruit))

This data can easily be converted to long / tidy format using pivot_longer. pivot_longer() transforms data to long format, while pivot_wider() converts data to wide format.

pivot_longer() takes four main arguments:
1. the data we want to transform
2. the columns we want to pivot longer
3. the column we want to create to store the column names - names_to
4. the column we want to create to store the values associated with the column names - quantity

(g2<-pivot_longer(grocery,cols=2:5,names_to="store",
                  values_to="quantity")) 
## # A tibble: 16 × 3
##    fruit   store   quantity
##    <chr>   <chr>      <dbl>
##  1 apples  hyvee         50
##  2 apples  publix        25
##  3 apples  kroger        35
##  4 apples  safeway       45
##  5 oranges hyvee         25
##  6 oranges publix        75
##  7 oranges kroger        80
##  8 oranges safeway       45
##  9 bananas hyvee         75
## 10 bananas publix        75
## 11 bananas kroger        25
## 12 bananas safeway       35
## 13 pears   hyvee         30
## 14 pears   publix        60
## 15 pears   kroger        15
## 16 pears   safeway       55

If we want this back in wide format, we use pivot_wider, which takes three main arguments:
1. the data we are reshaping
2. the column that includes the new column names - names_from
3. the column that includes the values that will fill our new columns - values_from

pivot_wider(g2,names_from=store,values_from=quantity)
## # A tibble: 4 × 5
##   fruit   hyvee publix kroger safeway
##   <chr>   <dbl>  <dbl>  <dbl>   <dbl>
## 1 apples     50     25     35      45
## 2 oranges    25     75     80      45
## 3 bananas    75     75     25      35
## 4 pears      30     60     15      55

In genomics, we often receive data in wide format. For example, you may be given RNAseq data from a colleague with the first column being sampleIDs and all additional columns being genes. The data frame itself holds count data.

At times we may need to work with a matrix while at others we may need a dataframe. These functions can be used to go from a df to a matrix and back again.

#take a look at fun_df again;
#we see genes in one column and expression values in another.
fun_df 
##    genes counts
## 1      A     NA
## 2      A    214
## 3      A     NA
## 4      B    352
## 5      B    256
## 6      B     NA
## 7      C    400
## 8      C    381
## 9      C    250
## 10     D    278
## 11     D     NA
## 12     D    169
#let's add a sample column and overwrite fun_df
#this data is in long format
fun_df<-data.frame(fun_df, sampleid=rep(c("A1","B1","C1"),4)) 

#let's convert to wide format using pivot_wider
fun_df_w<-fun_df %>% 
  pivot_wider(sampleid,names_from=genes,values_from=counts) 
fun_df_w
## # A tibble: 3 × 5
##   sampleid     A     B     C     D
##   <chr>    <int> <int> <int> <int>
## 1 A1          NA   352   400   278
## 2 B1         214   256   381    NA
## 3 C1          NA    NA   250   169
#convert to matrix
fun_mat<-fun_df_w %>% column_to_rownames("sampleid") %>% 
  as.matrix(rownames.force=TRUE)
fun_mat
##      A   B   C   D
## A1  NA 352 400 278
## B1 214 256 381  NA
## C1  NA  NA 250 169
#Now, we can easily apply functions that require data matrices.

There are other reasons you may be interested in using pivot_wider or pivot_longer. In my experience, most uses revolve around plotting criteria. For example, you may want to plot two different but related measurements on the same plot. You could pivot_longer so that those two measurements are now in the same column, stored as a variable.

Let's see how this might work with our scaled_counts data. We want to plot both "counts" and "counts_scaled" together in a density plot to understand the distribution of the data. Did scaling the counts improve the distribution?

#put counts and counts_scaled into a column named source 
#with their values in a column named abundance
scounts_long<- scaled_counts %>% #getting the data
  pivot_longer(cols = c("counts", "counts_scaled"), 
               names_to = "source", values_to = "abundance") #pivot
head(scounts_long)
## # A tibble: 6 × 18
##   feature  sample SampleName cell  dex   albut Run   avgLength Experiment Sample
##   <chr>     <int> <chr>      <chr> <chr> <chr> <chr>     <int> <chr>      <chr> 
## 1 ENSG000…    508 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345  SRS50…
## 2 ENSG000…    508 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345  SRS50…
## 3 ENSG000…    508 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345  SRS50…
## 4 ENSG000…    508 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345  SRS50…
## 5 ENSG000…    508 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345  SRS50…
## 6 ENSG000…    508 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345  SRS50…
## # … with 8 more variables: BioSample <chr>, transcript <chr>, ref_genome <chr>,
## #   .abundant <lgl>, TMM <dbl>, multiplier <dbl>, source <chr>, abundance <dbl>
#PLot using ggplot2; It's not important to understand this code here. 
ggplot(data=scounts_long, aes(x = abundance + 1, color = SampleName))+
  geom_density() +
  facet_wrap(~source) +
  ggplot2::scale_x_log10() +
  theme_bw()+
  ylab("Density")+
  xlab(expression('log'[10]*'Abundance'))

Note: There are other ways to reformat data in R. Check out the package reshape2.

Exporting files from RStudio

Remember, because we are using RStudio server through DNAnexus, any files created by you today will be erased at the end of the session.

To use the materials you generated on the RServer on DNAnexus on your local computer, you will need to export your files using More and Export in the Files pane.

Acknowledgements

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola. This lesson was also inspired by material from "Manipulating and analyzing data with dplyr", Introduction to data analysis with R and Bioconductor.

Resources

  1. R for Data Science
  2. Statistical Inference via Data Science: A ModernDive into R and the tidyverse
  3. dplyr cheatsheet
  4. tidyr cheatsheet
  5. Other cheatsheets