Skip to content

Data Frames and Data Wrangling (Part 1)

This lesson will introduce data wrangling with R. Attendees will learn to filter data using base R and tidyverse (dplyr) functionality.

Learning Objectives

  • Understand the concept of tidy data.
  • Become familiar with the tidyverse packages.
  • Be able to filter a data frame by rows and columns using base R and dplyr.

Best Practices for organizing genomic data

Let's refer back to our best practices for organizing genomic data.

  1. "Keep raw data separate from analyzed data" -- datacarpentry.org

  2. "Keep spreadsheet data Tidy" -- datacarpentry.org

    But, what is tidy data???

  3. "Trust but verify" -- datacarpentry.org

Introducing tidy data

What is tidy data?

Tidy data is an approach (or philosophy) to data organization and management. There are three rules to tidy data: (1) each variable forms its own column, (2) each observation forms a row, and (3) each value has its own cell. One advantage to following these rules is that the data structure remains consistent, making it easier to understand the tools that work well with the underlying structure, and there are a lot of tools in R built specifically to interact with tidy data. Equipped with the right tools will make data analysis more efficient. See the infographics below.

Tidy Data

Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration

Working with Tidy data
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration

What is messy data?

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham.

Messy data sets tend to share five common problems:

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables. --- Hadley Wickham, Tidy Data

Let's look at a few examples of untidy data presented in Tidyverse Skills for Data Science.

Remember our data frame scaled_counts? scaled_counts is a tidy data frame. Let's take a moment to envision an untidy data frame containing the same data. Again, remember, there are infinite possibilities for messy data, but here is one example.

The orginal data frame:

#import the data and save to an object called scaled_counts
scaled_counts<-read.delim(
  "./data/filtlowabund_scaledcounts_airways.txt", as.is=TRUE)
head(scaled_counts)
          feature sample counts SampleName   cell   dex albut        Run
1 ENSG00000000003    508    679 GSM1275862 N61311 untrt untrt SRR1039508
2 ENSG00000000419    508    467 GSM1275862 N61311 untrt untrt SRR1039508
3 ENSG00000000457    508    260 GSM1275862 N61311 untrt untrt SRR1039508
4 ENSG00000000460    508     60 GSM1275862 N61311 untrt untrt SRR1039508
5 ENSG00000000971    508   3251 GSM1275862 N61311 untrt untrt SRR1039508
6 ENSG00000001036    508   1433 GSM1275862 N61311 untrt untrt SRR1039508
  avgLength Experiment    Sample    BioSample transcript ref_genome .abundant
1       126  SRX384345 SRS508568 SAMN02422669     TSPAN6       hg38      TRUE
2       126  SRX384345 SRS508568 SAMN02422669       DPM1       hg38      TRUE
3       126  SRX384345 SRS508568 SAMN02422669      SCYL3       hg38      TRUE
4       126  SRX384345 SRS508568 SAMN02422669   C1orf112       hg38      TRUE
5       126  SRX384345 SRS508568 SAMN02422669        CFH       hg38      TRUE
6       126  SRX384345 SRS508568 SAMN02422669      FUCA2       hg38      TRUE
       TMM multiplier counts_scaled
1 1.055278   1.415149     960.88642
2 1.055278   1.415149     660.87475
3 1.055278   1.415149     367.93883
4 1.055278   1.415149      84.90896
5 1.055278   1.415149    4600.65058
6 1.055278   1.415149    2027.90904

An untidy version (subset) of scaled_counts.

                                                        508
cell                                                 N61311
dex                                                   untrt
SampleName                                       GSM1275862
Run / Experiment / Accession SRR1039508;SRX384345;SRS508568
TSPAN6                                     960.886417275434
DPM1                                       660.874752382368
SCYL3                                      367.938834302817
C1orf112                                   84.9089617621886
CFH                                        4600.65057814792
                                                        509
cell                                                 N61311
dex                                                     trt
SampleName                                       GSM1275863
Run / Experiment / Accession SRR1039509;SRX384346;SRS508567
TSPAN6                                     716.779730254346
DPM1                                       823.976698841491
SCYL3                                      337.590453311757
C1orf112                                   87.9975115267612
CFH                                        5886.23354376281
                                                        512
cell                                                N052611
dex                                                   untrt
SampleName                                       GSM1275866
Run / Experiment / Accession SRR1039512;SRX384349;SRS508571
TSPAN6                                     1075.40953718585
DPM1                                       764.982041915709
SCYL3                                      323.977901809712
C1orf112                                   49.2742055984354
CFH                                        7609.16919953838
                                                        513
cell                                                N052611
dex                                                     trt
SampleName                                       GSM1275867
Run / Experiment / Accession SRR1039513;SRX384350;SRS508572
TSPAN6                                     871.667100344899
DPM1                                       779.800224573256
SCYL3                                      350.375991315107
C1orf112                                   74.7753640001752
CFH                                        9084.13850653557
                                                        516
cell                                                N080611
dex                                                   untrt
SampleName                                       GSM1275870
Run / Experiment / Accession SRR1039516;SRX384353;SRS508575
TSPAN6                                     1392.02747542151
DPM1                                       718.031746988071
SCYL3                                      299.689570719041
C1orf112                                   95.4113735350417
CFH                                        8221.28001960276
                                                        517
cell                                                N080611
dex                                                     trt
SampleName                                       GSM1275871
Run / Experiment / Accession SRR1039517;SRX384354;SRS508576
TSPAN6                                      1074.3148432507
DPM1                                       819.844851726182
SCYL3                                      339.635351591197
C1orf112                                   64.6435865566326
CFH                                        11314.6798247617
                                                        520
cell                                                N061011
dex                                                   untrt
SampleName                                       GSM1275874
Run / Experiment / Accession SRR1039520;SRX384357;SRS508579
TSPAN6                                     1212.77045557059
DPM1                                       656.786077886933
SCYL3                                      366.981189802531
C1orf112                                   119.702018991383
CFH                                        8152.33750393948
                                                        521
cell                                                N061011
dex                                                     trt
SampleName                                       GSM1275875
Run / Experiment / Accession SRR1039521;SRX384358;SRS508580
TSPAN6                                     868.672540272425
DPM1                                       771.478409892293
SCYL3                                      347.772747766408
C1orf112                                   91.1194972313732
CFH                                        12141.6730060805

Tools for working with tidy data

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ---tidyverse.org

The core packages within tidyverse include dplyr, ggplot2, forcats, tibble, readr, stringr, tidyr, and purr, some of which we will use in this lesson and future lessons.

Load the core tidyverse packages

The tidyverse includes a collection of packages. To use these packages, we need to load the packages using the library() function.

library(tidyverse)

Load the data

To explore tidyverse functionality, let's read in some data and take a look.

#let's use our differential expression results
dexp<-readRDS("./data/diffexp_results_edger_airways.rds")
We've already learned str(), but let's check out the tidyverse equivalent, glimpse().

str(dexp, give.attr=FALSE)
tibble [15,926 × 10] (S3: tbl_df/tbl/data.frame)
 $ feature   : chr [1:15926] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
 $ albut     : Factor w/ 1 level "untrt": 1 1 1 1 1 1 1 1 1 1 ...
 $ transcript: chr [1:15926] "TSPAN6" "DPM1" "SCYL3" "C1orf112" ...
 $ ref_genome: chr [1:15926] "hg38" "hg38" "hg38" "hg38" ...
 $ .abundant : logi [1:15926] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ logFC     : num [1:15926] -0.3901 0.1978 0.0292 -0.1244 0.4173 ...
 $ logCPM    : num [1:15926] 5.06 4.61 3.48 1.47 8.09 ...
 $ F         : num [1:15926] 32.8495 6.9035 0.0969 0.3772 29.339 ...
 $ PValue    : num [1:15926] 0.000312 0.028062 0.762913 0.554696 0.000463 ...
 $ FDR       : num [1:15926] 0.00283 0.07701 0.84425 0.68233 0.00376 ...
glimpse(dexp) 
Rows: 15,926
Columns: 10
$ feature    <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E…
$ albut      <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
$ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL…
$ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg…
$ .abundant  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ logFC      <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417…
$ logCPM     <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,…
$ F          <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9…
$ PValue     <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0…
$ FDR        <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376…

We can see that glimpse() is a little more succinct and clean and that str() will show attributes. These attribues were ignored above using give.attr=FALSE to get around package dependencies.

sscaled

We will also use a subset version of scaled_counts that includes the columns "sample, "cell", "dex", "transcript", "counts", and "counts_scaled".

  scaled_counts[
    c("sample","cell","dex","transcript","counts","counts_scaled")]

We can use import functions from the tidyverse to load sscaled, which are a bit more efficient and create a data frame like object, known as a tibble.

#import sscaled
sscaled<-read_delim("data/sscaled.txt")
Rows: 127408 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (3): cell, dex, transcript
dbl (3): sample, counts, counts_scaled

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sscaled  
# A tibble: 127,408 × 6
   sample cell   dex       transcript counts counts_scaled
    <dbl> <chr>  <chr>     <chr>       <dbl>         <dbl>
 1    508 N61311 untreated TSPAN6        679         961. 
 2    508 N61311 untreated DPM1          467         661. 
 3    508 N61311 untreated SCYL3         260         368. 
 4    508 N61311 untreated C1orf112       60          84.9
 5    508 N61311 untreated CFH          3251        4601. 
 6    508 N61311 untreated FUCA2        1433        2028. 
 7    508 N61311 untreated GCLC          519         734. 
 8    508 N61311 untreated NFYA          394         558. 
 9    508 N61311 untreated STPG1         172         243. 
10    508 N61311 untreated NIPAL3       2112        2989. 
# ℹ 127,398 more rows

Notice the tibble output. Tibbles easily show you the data types in each column and limit output to the screen (i.e., first 10 rows; only the columns that fit). They also do not force syntactic variable names.

Subsetting data frames with base R

Before diving into subsetting with dplyr, let's take a step back and learn to subset with base R. Subsetting a data frame is similar to subsetting a vector; we can use bracket notation []. However, a data frame is two dimensional with both rows and columns, so we can specify either one argument or two arguments (e.g., df[row,column]) depending. If you provide one argument, columns will be assumed. This is because a data frame has characteristics of both a list and a matrix.

For now, let's focus on providing two arguments to subset. (Note when a df structure is returned)

scaled_counts[2,4] #Returns the value in the 4th column and 2nd row

scaled_counts[2, ] #Returns a df with row two

scaled_counts[-1, ] #Returns a df without row 1

scaled_counts[1:4,1] #returns a vector of rows 1-4 of column 1

#call names of columns directly
scaled_counts[1:10,c("sample","counts")]

#use comparison operators
scaled_counts[scaled_counts$sample == "508",]

What happens when we provide a single argument?

#notice the difference here
scaled_counts[,2] #returns column two
#treated similar to a matrix
#does not return a df if the output is a single column

scaled_counts[2] #returns column two
#treated similar to a list; maintains the df structure

#You can preserve the structure of the data frame while subsetting
# use drop = F
scaled_counts[,2,drop=F]

Note

We can also use [[]] or $ for selecting specific columns.

Using %in%

%in% "returns a logical vector indicating if there is a match or not for its left operand". This logical vector can then be used to filter the datamframe to only matched values.

For example, perhaps we have 4 transcripts that we are interested in exploring further. We can assign those transcripts to a vector.

keep_t<-c("CPD","EXT1","MCL1","LASP1")

We can then see where those transcripts match transcripts in sscaled$trasncript.

head(sscaled$transcript %in% keep_t)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

We can further use this logical transcript to filter our data frame by true values.

#Let's filter our data to only include 4 transcripts of interest
interesting_trnsc<-sscaled[sscaled$transcript %in% keep_t,]

Tips to remember for subsetting

  • Typically provide two values separated by commas: data.frame[row, column]
  • In cases where you are taking a continuous range of numbers use a colon between the numbers (start:stop, inclusive)
  • For a non continuous set of numbers, pass a vector using c()
  • Index using the name of a column(s) by passing them as vectors using c() ---datacarpentry.org

Info

Subsetting including simplifying vs preserving can get confusing. Here is a great chapter - though, a bit more advanced - that may clear things up if you are confused.

Data wrangling with tidyverse

While bracket notation is useful, it is not always the most readable or easy to employ, especially for beginners. This is where dplyr comes in. The dplyr package in the tidyverse world simplifies data wrangling with easy to employ and easy to understand functions specific for data manipulation in data frames.

The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It was built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned. This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database that is over 100s of GB, conduct queries on it directly and pull back just what you need for analysis in R. --- datacarpentry.com

We do not need to load the dplyr package, as it is included in library(tidyverse), which we have already installed and loaded. However, if you need to install and load on your local machine you would use the following:

install.packages("dplyr") 
library("dplyr")

Subsetting with dplyr

We've seen how to select columns and rows using base R, but now let's look at a more intuitive way with functions (select() and filter()) from the tidyverse package dplyr.

Selecting columns

select() requires the data frame followed by the columns that we want to select or deselect as arguments.

#select the gene / transcript, logFC, and FDR corrected p-value
#first argument is the df followed by columns to select
dexp_s<-select(dexp, transcript, logFC, FDR) 

We can also select all columns, leaving out ones that do not interest us using - or !. This is helpful if the columns to keep far outweigh those to exclude.

df_exp<-select(dexp, -feature)

For readability we should move the transcript column to the front. We can select a range of columns using the :.

#you can reorder columns and call a range of columns using select().
df_exp<-select(df_exp, transcript:FDR,albut)  
#Note: this also would have excluded the feature column 

We can also include helper functions such as starts_with() and ends_with(). See more helper functions with ?select().

select(df_exp, transcript, starts_with("log"), FDR)
# A tibble: 15,926 × 4
   transcript   logFC logCPM     FDR
   <chr>        <dbl>  <dbl>   <dbl>
 1 TSPAN6     -0.390    5.06 0.00283
 2 DPM1        0.198    4.61 0.0770 
 3 SCYL3       0.0292   3.48 0.844  
 4 C1orf112   -0.124    1.47 0.682  
 5 CFH         0.417    8.09 0.00376
 6 FUCA2      -0.250    5.91 0.0186 
 7 GCLC       -0.0581   4.84 0.794  
 8 NFYA       -0.509    4.13 0.00126
 9 STPG1      -0.136    3.12 0.478  
10 NIPAL3     -0.0500   7.04 0.695  
# ℹ 15,916 more rows

Test your learning

  1. From the interesting_trnsc data frame select the following columns and save to an object: sample, dex, transcript, counts_scaled.

    Possible Solution

    interesting_trnsc_s<- select(interesting_trnsc, sample, dex, transcript, counts_scaled)
    

  2. From the interesting_trnsc data frame select all columns except cell and counts.

    Possible Solution

    interesting_trnsc_s2<-select(interesting_trnsc,!c(cell,counts))
    

Filtering by row

Now let's filter the rows based on a condition. Let's look at only the treated samples in scaled_counts using the function filter(). filter() requires the df as the first argument followed by the filtering conditions.

filter(sscaled, dex == "treated") #we've seen == notation before

We can also filter using %in%

#filter for two cell lines
f_sscale<-filter(sscaled,cell %in% c("N061011", "N052611"))
#let's check that this worked
levels(factor(f_sscale$cell)) 
[1] "N052611" "N061011"
#let's filter by keep_t from above
filter(f_sscale,transcript %in% keep_t) 
# A tibble: 16 × 6
   sample cell    dex       transcript counts counts_scaled
    <dbl> <chr>   <chr>     <chr>       <dbl>         <dbl>
 1    512 N052611 untreated LASP1        7831         9647.
 2    512 N052611 untreated CPD          8270        10187.
 3    512 N052611 untreated MCL1         5170         6369.
 4    512 N052611 untreated EXT1         8503        10474.
 5    513 N052611 treated   LASP1        5809        12411.
 6    513 N052611 treated   CPD          7638        16318.
 7    513 N052611 treated   MCL1         5153        11009.
 8    513 N052611 treated   EXT1         2317         4950.
 9    520 N061011 untreated LASP1        5766         9082.
10    520 N061011 untreated CPD          7067        11131.
11    520 N061011 untreated MCL1         4410         6946.
12    520 N061011 untreated EXT1         6925        10907.
13    521 N061011 treated   LASP1        7825        11884.
14    521 N061011 treated   CPD         10091        15325.
15    521 N061011 treated   MCL1         7338        11144.
16    521 N061011 treated   EXT1         3242         4923.

And we can filter using numeric columns. There are lots of options for filtering so explore the functionality a bit when you get a chance.

#get only results from counts greater than or equal to 20k
#use head to get only the first handful of rows
head(filter(f_sscale,counts_scaled >= 20000))  
# A tibble: 6 × 6
  sample cell    dex       transcript counts counts_scaled
   <dbl> <chr>   <chr>     <chr>       <dbl>         <dbl>
1    512 N052611 untreated CSDE1       19863        24468.
2    512 N052611 untreated MRC2        23978        29537.
3    512 N052611 untreated DCN        422752       520769.
4    512 N052611 untreated VIM         37558        46266.
5    512 N052611 untreated CD44        25453        31354.
6    512 N052611 untreated VCL         17309        21322.
#use `|` operator 
#look at only results with named genes (not NAs) 
#and those with a log fold change greater than 2 
#and either a p-value or an FDR corrected p_value < or = to 0.01
#The comma acts as &
sig_annot_transcripts<-
  filter(df_exp, !is.na(transcript),
         abs(logFC) > 2, (PValue | FDR <= 0.01))

Test your learning

Filter the interesting_trnsc data frame to only include the following genes: MCL1 and EXT1.

Possible Solution
interesting_trnsc_f<-filter(interesting_trnsc, transcript %in% c("MCL1","EXT1"))

interesting_trnsc_f<-filter(interesting_trnsc, transcript == "MCL1" | transcript =="EXT1")

Acknowledgements

Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola. This lesson was also inspired by material from "Manipulating and analyzing data with dplyr", Introduction to data analysis with R and Bioconductor.

Resources

  1. R for Data Science
  2. Statistical Inference via Data Science: A ModernDive into R and the tidyverse
  3. dplyr cheatsheet
  4. tidyr cheatsheet
  5. Other cheatsheets