Introduction to Data Wrangling with the Tidyverse
Objectives
Wrangle data using tidyverse functionality (i.e., dplyr
). To this end, you should understand:
1. how to use common dplyr
functions (e.g., select()
, group_by()
, arrange()
, mutate()
, summarize()
, and filter()
)
2. how to employ the pipe (%>%
) operator to link functions
3. how to perform more complicated wrangling using the split, apply, combine concept
4. how to tidy data using tidyr
.
Reminder: Uploading files from RStudio Server
Any files created by you today will be erased at the end of the session. You can upload any files you downloaded from the last session using the Upload
option in the Files pane.
Best Practices for organizing genomic data**
Let's refer back to our best practices for organizing genomic data.
"Keep raw data separate from analyzed data" -- datacarpentry.org
"Keep spreadsheet data Tidy" -- datacarpentry.org
"Trust but verify" -- datacarpentry.org
But, what is tidy data???
Introducing tidy data
What is tidy data?
Tidy data is an approach (or philosophy) to data organization and management. There are three rules to tidy data: (1) each variable forms its own column, (2) each observation forms a row, and (3) each value has its own cell. One advantage to following these rules is that the data structure remains consistent, making it easier to understand the tools that work well with the underlying structure, and there are a lot of tools in R built specifically to interact with tidy data. Equipped with the right tools will make data analysis more efficient. See the infographics below.
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration
Image from Lowndes and Horst 2020: Tidy Data for Efficiency, Reproducibility, and Collaboration
What is messy data?
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham.
Messy data sets tend to share five common problems:
- Column headers are values, not variable names.
- Multiple variables are stored in one column.
- Variables are stored in both rows and columns.
- Multiple types of observational units are stored in the same table.
- A single observational unit is stored in multiple tables. --- Hadley Wickham, Tidy Data
Let's look at a few examples of untidy data presented in Tidyverse Skills for Data Science.
Remember our data frame scaled_counts
? scaled_counts
is a tidy data frame. Let's take a moment to envision an untidy data frame that contains the same data. Again, remember, there are infinite possibilities for messy data, but here is one example.
The orginal data frame:
#import the data and save to an object called scaled_counts
scaled_counts<-read.delim(
"./data/filtlowabund_scaledcounts_airways.txt", as.is=TRUE)
head(scaled_counts)
## feature sample counts SampleName cell dex albut Run
## 1 ENSG00000000003 508 679 GSM1275862 N61311 untrt untrt SRR1039508
## 2 ENSG00000000419 508 467 GSM1275862 N61311 untrt untrt SRR1039508
## 3 ENSG00000000457 508 260 GSM1275862 N61311 untrt untrt SRR1039508
## 4 ENSG00000000460 508 60 GSM1275862 N61311 untrt untrt SRR1039508
## 5 ENSG00000000971 508 3251 GSM1275862 N61311 untrt untrt SRR1039508
## 6 ENSG00000001036 508 1433 GSM1275862 N61311 untrt untrt SRR1039508
## avgLength Experiment Sample BioSample transcript ref_genome .abundant
## 1 126 SRX384345 SRS508568 SAMN02422669 TSPAN6 hg38 TRUE
## 2 126 SRX384345 SRS508568 SAMN02422669 DPM1 hg38 TRUE
## 3 126 SRX384345 SRS508568 SAMN02422669 SCYL3 hg38 TRUE
## 4 126 SRX384345 SRS508568 SAMN02422669 C1orf112 hg38 TRUE
## 5 126 SRX384345 SRS508568 SAMN02422669 CFH hg38 TRUE
## 6 126 SRX384345 SRS508568 SAMN02422669 FUCA2 hg38 TRUE
## TMM multiplier counts_scaled
## 1 1.055278 1.415149 960.88642
## 2 1.055278 1.415149 660.87475
## 3 1.055278 1.415149 367.93883
## 4 1.055278 1.415149 84.90896
## 5 1.055278 1.415149 4600.65058
## 6 1.055278 1.415149 2027.90904
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Adding missing grouping variables: `sample`
An untidy version (subset) of scaled_counts
.
c # view a snapshot of an untidy data frame
## 508
## cell N61311
## dex untrt
## SampleName GSM1275862
## Run / Experiment / Accession SRR1039508;SRX384345;SRS508568
## TSPAN6 960.886417275434
## DPM1 660.874752382368
## SCYL3 367.938834302817
## C1orf112 84.9089617621886
## CFH 4600.65057814792
## 509
## cell N61311
## dex trt
## SampleName GSM1275863
## Run / Experiment / Accession SRR1039509;SRX384346;SRS508567
## TSPAN6 716.779730254346
## DPM1 823.976698841491
## SCYL3 337.590453311757
## C1orf112 87.9975115267612
## CFH 5886.23354376281
## 512
## cell N052611
## dex untrt
## SampleName GSM1275866
## Run / Experiment / Accession SRR1039512;SRX384349;SRS508571
## TSPAN6 1075.40953718585
## DPM1 764.982041915709
## SCYL3 323.977901809712
## C1orf112 49.2742055984354
## CFH 7609.16919953838
## 513
## cell N052611
## dex trt
## SampleName GSM1275867
## Run / Experiment / Accession SRR1039513;SRX384350;SRS508572
## TSPAN6 871.667100344899
## DPM1 779.800224573256
## SCYL3 350.375991315107
## C1orf112 74.7753640001752
## CFH 9084.13850653557
## 516
## cell N080611
## dex untrt
## SampleName GSM1275870
## Run / Experiment / Accession SRR1039516;SRX384353;SRS508575
## TSPAN6 1392.02747542151
## DPM1 718.031746988071
## SCYL3 299.689570719041
## C1orf112 95.4113735350417
## CFH 8221.28001960276
## 517
## cell N080611
## dex trt
## SampleName GSM1275871
## Run / Experiment / Accession SRR1039517;SRX384354;SRS508576
## TSPAN6 1074.3148432507
## DPM1 819.844851726182
## SCYL3 339.635351591197
## C1orf112 64.6435865566326
## CFH 11314.6798247617
## 520
## cell N061011
## dex untrt
## SampleName GSM1275874
## Run / Experiment / Accession SRR1039520;SRX384357;SRS508579
## TSPAN6 1212.77045557059
## DPM1 656.786077886933
## SCYL3 366.981189802531
## C1orf112 119.702018991383
## CFH 8152.33750393948
## 521
## cell N061011
## dex trt
## SampleName GSM1275875
## Run / Experiment / Accession SRR1039521;SRX384358;SRS508580
## TSPAN6 868.672540272425
## DPM1 771.478409892293
## SCYL3 347.772747766408
## C1orf112 91.1194972313732
## CFH 12141.6730060805
Tools for working with tidy data
The
tidyverse
is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. ---tidyverse.org
The core packages within tidyverse include dplyr
, ggplot2
, forcats
, tibble
, readr
, stringr
, tidyr
, and purr
, some of which we will use in this lesson and future lessons.
Load the tidyverse library
The tidyverse includes a collection of packages. To use these packages, we need to load the packages using the library()
function.
library(tidyverse)
Data wrangling with tidyverse
While bracket notation is useful, it is not always the most readable or easy to employ, especially for beginners. This is where dplyr
comes in. The dplyr
package in the tidyverse
world simplifies data wrangling with easy to employ and easy to understand functions specific for data manipulation in data frames.
The package dplyr is a fairly new (2014) package that tries to provide easy tools for the most common data manipulation tasks. It was built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. dplyr addresses this by porting much of the computation to C++. An additional feature is the ability to work with data stored directly in an external database. The benefits of doing this are that the data can be managed natively in a relational database, queries can be conducted on that database, and only the results of the query returned. This addresses a common problem with R in that all operations are conducted in memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can have a database that is over 100s of GB, conduct queries on it directly and pull back just what you need for analysis in R. --- datacarpentry.com
We do not need to load the dplyr
package, as it is included in library(tidyverse)
, which we have already installed and loaded. However, if you need to install and load on your local machine you would use the following:
install.packages("dplyr") ## install
library("dplyr")
#let's use our differential expression results
dexp<-readRDS("./data/diffexp_results_edger_airways.rds")
#We've already learned str()
#but there is a tidyverse equivalent, glimpse()
str(dexp, give.attr=FALSE)
## tibble [15,926 × 10] (S3: tbl_df/tbl/data.frame)
## $ feature : chr [1:15926] "ENSG00000000003" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" ...
## $ albut : Factor w/ 1 level "untrt": 1 1 1 1 1 1 1 1 1 1 ...
## $ transcript: chr [1:15926] "TSPAN6" "DPM1" "SCYL3" "C1orf112" ...
## $ ref_genome: chr [1:15926] "hg38" "hg38" "hg38" "hg38" ...
## $ .abundant : logi [1:15926] TRUE TRUE TRUE TRUE TRUE TRUE ...
## $ logFC : num [1:15926] -0.3901 0.1978 0.0292 -0.1244 0.4173 ...
## $ logCPM : num [1:15926] 5.06 4.61 3.48 1.47 8.09 ...
## $ F : num [1:15926] 32.8495 6.9035 0.0969 0.3772 29.339 ...
## $ PValue : num [1:15926] 0.000312 0.028062 0.762913 0.554696 0.000463 ...
## $ FDR : num [1:15926] 0.00283 0.07701 0.84425 0.68233 0.00376 ...
glimpse(dexp)
## Rows: 15,926
## Columns: 10
## $ feature <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E…
## $ albut <fct> untrt, untrt, untrt, untrt, untrt, untrt, untrt, untrt, unt…
## $ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL…
## $ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg…
## $ .abundant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ logFC <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417…
## $ logCPM <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,…
## $ F <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9…
## $ PValue <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0…
## $ FDR <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376…
#we can see that glimpse is a little more succinct and clean
#also str() will show attributes.
#These were ignored above using give.attr=FALSE to get around package
#dependencies
sscaled
data, so let's load that here, as well as a data frame called interesting_trnsc
, which contains only 4 transcripts we deemed interesting.
We can use import functions from the tidyverse, which are a bit more efficient and create a data frame like object, known as a tibble.
#import sscaled
sscaled<-read_delim("data/sscaled.txt")
## Rows: 127408 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): cell, dex, transcript
## dbl (3): sample, counts, counts_scaled
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sscaled
## # A tibble: 127,408 × 6
## sample cell dex transcript counts counts_scaled
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 508 N61311 untreated TSPAN6 679 961.
## 2 508 N61311 untreated DPM1 467 661.
## 3 508 N61311 untreated SCYL3 260 368.
## 4 508 N61311 untreated C1orf112 60 84.9
## 5 508 N61311 untreated CFH 3251 4601.
## 6 508 N61311 untreated FUCA2 1433 2028.
## 7 508 N61311 untreated GCLC 519 734.
## 8 508 N61311 untreated NFYA 394 558.
## 9 508 N61311 untreated STPG1 172 243.
## 10 508 N61311 untreated NIPAL3 2112 2989.
## # … with 127,398 more rows
#Create interesting_trnsc
keep_t<-c("CPD","EXT1","MCL1","LASP1")
interesting_trnsc<-sscaled[sscaled$transcript %in% keep_t,]
interesting_trnsc
## # A tibble: 32 × 6
## sample cell dex transcript counts counts_scaled
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 508 N61311 untreated LASP1 6314 8935.
## 2 508 N61311 untreated CPD 4533 6415.
## 3 508 N61311 untreated MCL1 5216 7381.
## 4 508 N61311 untreated EXT1 6279 8886.
## 5 509 N61311 treated LASP1 6364 10182.
## 6 509 N61311 treated CPD 4578 7325.
## 7 509 N61311 treated MCL1 6239 9982.
## 8 509 N61311 treated EXT1 3550 5680.
## 9 512 N052611 untreated LASP1 7831 9647.
## 10 512 N052611 untreated CPD 8270 10187.
## # … with 22 more rows
Subsetting with dplyr
How can we select only columns of interest and rows of interest? We can use dplyr's select()
and filter()
.
Selecting columns
select()
requires the data frame followed by the columns that we want to select or deselect as arguments.
#select the gene / transcript, logFC, and FDR corrected p-value
#first argument is the df followed by columns to select
dexp_s<-select(dexp, transcript, logFC, FDR)
We can also select all columns, leaving out ones that do not interest us using -
or !
. This is helpful if the columns to keep far outweigh those to exclude.
df_exp<-select(dexp, -feature)
For readability we should move the transcript column to the front
#you can reorder columns and call a range of columns using select().
df_exp<-select(df_exp, transcript:FDR,albut)
#Note: this also would have excluded the feature column
We can also include helper functions such as starts_with()
and ends_with()
. See more helper functions with ?select()
.
select(df_exp, transcript, starts_with("log"), FDR)
## # A tibble: 15,926 × 4
## transcript logFC logCPM FDR
## <chr> <dbl> <dbl> <dbl>
## 1 TSPAN6 -0.390 5.06 0.00283
## 2 DPM1 0.198 4.61 0.0770
## 3 SCYL3 0.0292 3.48 0.844
## 4 C1orf112 -0.124 1.47 0.682
## 5 CFH 0.417 8.09 0.00376
## 6 FUCA2 -0.250 5.91 0.0186
## 7 GCLC -0.0581 4.84 0.794
## 8 NFYA -0.509 4.13 0.00126
## 9 STPG1 -0.136 3.12 0.478
## 10 NIPAL3 -0.0500 7.04 0.695
## # … with 15,916 more rows
Test your learning
- From the
interesting_trnsc
data frame select the following columns and save to an object: sample, dex, transcript, counts_scaled.
Possible Solution
interesting_trnsc_s<- select(interesting_trnsc, sample, dex, transcript, counts_scaled)
- From the 'interesting_trnsc` data frame select all columns except cell and counts.
Possible Solution
interesting_trnsc_s2<-select(interesting_trnsc,!c(cell,counts))
Filtering by row
Now let's filter the rows based on a condition. Let's look at only the treated samples in scaled_counts using the function filter()
. filter()
requires the df as the first argument followed by the filtering conditions.
filter(sscaled, dex == "treated") #we've seen == notation before
We can also filter using %in%
#filter for two cell lines
f_sscale<-filter(sscaled,cell %in% c("N061011", "N052611"))
#let's check that this worked
levels(factor(f_sscale$cell))
## [1] "N052611" "N061011"
#let's filter by keep_t from above
filter(f_sscale,transcript %in% keep_t)
## # A tibble: 16 × 6
## sample cell dex transcript counts counts_scaled
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 512 N052611 untreated LASP1 7831 9647.
## 2 512 N052611 untreated CPD 8270 10187.
## 3 512 N052611 untreated MCL1 5170 6369.
## 4 512 N052611 untreated EXT1 8503 10474.
## 5 513 N052611 treated LASP1 5809 12411.
## 6 513 N052611 treated CPD 7638 16318.
## 7 513 N052611 treated MCL1 5153 11009.
## 8 513 N052611 treated EXT1 2317 4950.
## 9 520 N061011 untreated LASP1 5766 9082.
## 10 520 N061011 untreated CPD 7067 11131.
## 11 520 N061011 untreated MCL1 4410 6946.
## 12 520 N061011 untreated EXT1 6925 10907.
## 13 521 N061011 treated LASP1 7825 11884.
## 14 521 N061011 treated CPD 10091 15325.
## 15 521 N061011 treated MCL1 7338 11144.
## 16 521 N061011 treated EXT1 3242 4923.
#get only results from counts greater than or equal to 20k
#use head to get only the first handful of rows
head(filter(f_sscale,counts_scaled >= 20000))
## # A tibble: 6 × 6
## sample cell dex transcript counts counts_scaled
## <dbl> <chr> <chr> <chr> <dbl> <dbl>
## 1 512 N052611 untreated CSDE1 19863 24468.
## 2 512 N052611 untreated MRC2 23978 29537.
## 3 512 N052611 untreated DCN 422752 520769.
## 4 512 N052611 untreated VIM 37558 46266.
## 5 512 N052611 untreated CD44 25453 31354.
## 6 512 N052611 untreated VCL 17309 21322.
#use `|` operator
#look at only results with named genes (not NAs)
#and those with a log fold change greater than 2
#and either a p-value or an FDR corrected p_value < or = to 0.01
#The comma acts as &
sig_annot_transcripts<-
filter(df_exp, !is.na(transcript),
abs(logFC) > 2, (PValue | FDR <= 0.01))
Test your learning
Filter the interesting_trnsc data frame to only include the following genes: MCL1 and EXT1.
Possible Solution
interesting_trnsc_f<-filter(interesting_trnsc, transcript %in% c("MCL1","EXT1"))
interesting_trnsc_f<-filter(interesting_trnsc, transcript == "MCL1" | transcript =="EXT1")
Introducing the pipe
Often we will apply multiple functions to wrangle a data frame into the state that we need it. For example, maybe you want to select and filter. What are our options? We could run one step after another, saving an object for each step, or we could nest a function within a function, but these can affect code readability and clutter our work space, making it difficult to follow what we or someone else did.
For example,
#Run one step at a time with intermediate objects.
#We've done this a few times above
#select gene, logFC, FDR
dexp_s<-select(dexp, transcript, logFC, FDR)
#Now filter for only the genes "TSPAN6" and DPM1
#Note: we could have used %in%
tspanDpm<- filter(dexp_s, transcript == "TSPAN6" | transcript=="DPM1")
#Nested code example; processed from inside out
tspanDpm<- filter(select(dexp, c(transcript, logFC, FDR)),
transcript == "TSPAN6" | transcript=="DPM1" )
Let's explore how piping streamlines this. Piping (using %>%
) allows you to employ multiple functions consecutively, while improving readability. The output of one function is passed directly to another without storing the intermediate steps as objects. You can pipe from the beginning (reading in the data) all the way to plotting without storing the data or intermediate objects, if you want. Pipes in R come from the magrittr
package, which is a dependency of dplyr
.
To pipe, we have to first call the data and then pipe it into a function. The output of each step is then piped into the next step.
Let's see how this works
tspanDpm <- dexp %>% #call the data and pipe to select()
select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) #filter
Notice that the data argument has been dropped from select()
and filter()
. This is because the pipe passes the input from the left to the right. The %>%
must be at the end of each line.
Piping from the beginning:
readRDS("./data/diffexp_results_edger_airways.rds") %>% #read data
select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) %>% #filter
ggplot(aes(x=transcript,y=logFC,fill=FDR)) + #plot
geom_bar(stat = "identity") +
theme_classic() +
geom_hline(yintercept=0, linetype="dashed", color = "black")
The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames. ---datacarpentry.org
Test your learning
Using what you have learned about select()
and filter()
, create a subsetted data frame from scaled_counts that only includes the columns 'sample', 'cell', 'dex', 'transcript', and 'counts_scaled' and only rows that include the treatment "untrt" and the transcripts "ACTN1" and "ANAPC4"?
Possible Solution
scaled_counts %>% select(sample,cell,dex,transcript,counts_scaled) %>%
filter(dex=="untrt", transcript %in% c("ACTN1","ANAPC4"))
Mutate and transmute
Other useful data manipulation functions from dplyr
include mutate()
and transmute()
. These functions allow you to create a new variable from existing variables. Perhaps you want to know the ratio of two columns or convert the units of a variable. That can be done with mutate()
.
mutate() adds new variables and preserves existing ones; transmute() adds new variables and drops existing ones. New variables overwrite existing variables of the same name. --- dplyr.tidyverse.org
Let's create a column in our original differential expression data frame denoting significant transcripts (those with an FDR corrected p-value less than 0.05 and a log fold change greater than or equal to 2).
dexp_sigtrnsc<-dexp %>% mutate(Significant= FDR<0.05 & abs(logFC) >=2)
The conditional evaluates to a logical vector (contains TRUE or FALSE).
We can also use mutate to coerce variables. Remember those one liners we used in the factor section to coerce our character vectors to factors?
To mutate across multiple columns, we need to use the function across()
with the select helper where()
. (Note:across()
has superseded the use of mutate_if
, _at
, _all
). For more information on across()
see this reference article.
#view sscaled
glimpse(sscaled)
## Rows: 127,408
## Columns: 6
## $ sample <dbl> 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 5…
## $ cell <chr> "N61311", "N61311", "N61311", "N61311", "N61311", "N6131…
## $ dex <chr> "untreated", "untreated", "untreated", "untreated", "unt…
## $ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "…
## $ counts <dbl> 679, 467, 260, 60, 3251, 1433, 519, 394, 172, 2112, 524,…
## $ counts_scaled <dbl> 960.88642, 660.87475, 367.93883, 84.90896, 4600.65058, 2…
#use mutate with across and select helpers
ex_coerce<-sscaled %>% mutate(across(where(is.character),as.factor))
glimpse(ex_coerce)
## Rows: 127,408
## Columns: 6
## $ sample <dbl> 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 508, 5…
## $ cell <fct> N61311, N61311, N61311, N61311, N61311, N61311, N61311, …
## $ dex <fct> untreated, untreated, untreated, untreated, untreated, u…
## $ transcript <fct> TSPAN6, DPM1, SCYL3, C1orf112, CFH, FUCA2, GCLC, NFYA, S…
## $ counts <dbl> 679, 467, 260, 60, 3251, 1433, 519, 394, 172, 2112, 524,…
## $ counts_scaled <dbl> 960.88642, 660.87475, 367.93883, 84.90896, 4600.65058, 2…
Test your learning
Using mutate apply a base-10 logarithmic transformation to the counts_scaled column of sscaled. Save the resulting data frame to an object called log10counts. Hint: see the function log10().
Possible Solution
log10counts<-sscaled %>% mutate(logCounts=log10(counts_scaled))
Arrange, group_by, summarize
There is an approach to data analysis known as "split-apply-combine", in which the data is split into smaller components, some type of analysis is applied to each component, and the results are combined. The dplyr
functions including group_by()
and summarize()
are key players in this type of workflow. The function arrange()
may also be handy.
group_by()
allows us to group a data frame by a categorical variable so that a given operation can be performed per group.
Let's get the median counts_scaled by transcript within a treatment.When you reduce the size of a data set through a calculation, you need to use summarize()
.
scaled_counts %>% #Call the data
group_by(dex,transcript) %>% # group_by treatment and transcript
#(transcript nested within treatment)
summarize(median_counts=median(counts_scaled))
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
## # A tibble: 29,152 × 3
## # Groups: dex [2]
## dex transcript median_counts
## <chr> <chr> <dbl>
## 1 trt A1BG-AS1 80.2
## 2 trt A2M 37821.
## 3 trt A2M-AS1 24.2
## 4 trt A4GALT 2043.
## 5 trt AAAS 1086.
## 6 trt AACS 481.
## 7 trt AADAT 154.
## 8 trt AAGAB 984.
## 9 trt AAK1 399.
## 10 trt AAMDC 181.
## # … with 29,142 more rows
Now, if we want the top five transcripts with the greatest median scaled counts by treatment, we need to organize our data frame and then return the top rows. We can use arrange()
to arrange our data frame by median_counts. If we want to arrange from highest to lowest value, we will additionally need to use desc()
. The .by_group
allows us to arrange by median counts within a grouping. By including slice_head()
we can return the top five values by group.
a<-scaled_counts %>% #Call the data
group_by(dex,transcript) %>% # group_by treatment and transcript
#(transcript nested within treatment)
summarize(median_counts=median(counts_scaled)) %>% #for each group
#calculate the median value of scaled counts
arrange(desc(median_counts),.by_group = TRUE) %>%
#arrange in descending order
slice_head(n=5) #return the top 5 values for each group
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
#can skip arrange and use slice_max
b<-scaled_counts %>%
group_by(dex,transcript) %>%
summarize(median_counts=median(counts_scaled)) %>%
slice_max(n=5, order_by=median_counts) #notice use of slice_max
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
How many rows per sample are in the scaled_counts data frame?
scaled_counts %>%
group_by(dex, sample) %>%
summarize(n=n()) #there are multiple functions that can be used here
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 3
## # Groups: dex [2]
## dex sample n
## <chr> <int> <int>
## 1 trt 509 15926
## 2 trt 513 15926
## 3 trt 517 15926
## 4 trt 521 15926
## 5 untrt 508 15926
## 6 untrt 512 15926
## 7 untrt 516 15926
## 8 untrt 520 15926
#See count(); can also use tally()
scaled_counts %>%
count(dex,sample)
## dex sample n
## 1 trt 509 15926
## 2 trt 513 15926
## 3 trt 517 15926
## 4 trt 521 15926
## 5 untrt 508 15926
## 6 untrt 512 15926
## 7 untrt 516 15926
## 8 untrt 520 15926
Note: By default, all [built in] R functions operating on vectors that contain missing data will return NA. It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. When dealing with simple statistics like the mean, the easiest way to ignore NA (the missing data) is to use na.rm = TRUE (rm stands for remove). ---datacarpentry.org
Let's see this in practice
set.seed(138) #This is used to get the same result
#with a pseudorandom number generator like sample()
#make mock data frame
fun_df<-data.frame(genes=rep(c("A","B","C","D"), each=3),
counts=sample(1:500,12,TRUE))
#Assign NAs if the value is less than 100. This is arbitrary.
fun_df<-fun_df %>% mutate(counts=ifelse(counts<100,NA,counts))
fun_df #view
## genes counts
## 1 A NA
## 2 A 214
## 3 A NA
## 4 B 352
## 5 B 256
## 6 B NA
## 7 C 400
## 8 C 381
## 9 C 250
## 10 D 278
## 11 D NA
## 12 D 169
#We should get NAs returned for some of our genes
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts),
median_count = median(counts),
min_count = min(counts),
max_count = max(counts))
## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <int> <int> <int>
## 1 A NA NA NA NA
## 2 B NA NA NA NA
## 3 C 344. 381 250 400
## 4 D NA NA NA NA
#Now let's use na.rm
fun_df %>%
group_by(genes) %>%
summarize(
mean_count = mean(counts, na.rm=TRUE),
median_count = median(counts, na.rm=TRUE),
min_count = min(counts, na.rm=TRUE),
max_count = max(counts, na.rm=TRUE))
## # A tibble: 4 × 5
## genes mean_count median_count min_count max_count
## <chr> <dbl> <dbl> <int> <int>
## 1 A 214 214 214 214
## 2 B 304 304 256 352
## 3 C 344. 381 250 400
## 4 D 224. 224. 169 278
Test your learning
Create a data frame summarizing the mean counts_scaled by sample from the scaled_counts data frame.
Possible Solution
scaled_counts %>% group_by(sample) %>%
summarize(Mean_counts_scaled=mean(counts_scaled))
Data Reshaping
Tidy data implies that we have one observation per row and one variable per column. This generally means data is in a long format. However, whether data is tidy or not will depend on what we consider a variable versus an observation, so wide data sets may also be tidy. Often if data is in a wide format, data related to the same measurement is distributed in different columns. This at times will mean the data looks more like a data matrix; though, it may not necessarily be a matrix.
Let's look at some grocery data.
grocery <- data.frame(fruit=c("apples", "oranges", "bananas", "pears"),
hyvee=c(50,25,75,30),publix=c(25,75,75,60),
kroger=c(35,80,25,15),safeway=c(45,45,35,55))
Formatted here, this data is in a wide format. We can see an initial column with fruit (i.e., apples, oranges, bananas, pears) followed by four columns containing integer data related to grocery stores. These columns (hyvee, publix, kroger, and safeway) all hold data of the same type, collected in the same way. Based on our rules, this isn't really tidy data, but if we want to plot fruit counts in one grocery store against another, this is the best data format. Note: I'm not sure what information you gain from this type of plot. For example,
ggplot(grocery)+
geom_point(aes(x=hyvee,y=publix,color=fruit))
This data can easily be converted to long / tidy format using pivot_longer. pivot_longer()
transforms data to long format, while pivot_wider()
converts data to wide format.
pivot_longer()
takes four main arguments:
1. the data we want to transform
2. the columns we want to pivot longer
3. the column we want to create to store the column names - names_to
4. the column we want to create to store the values associated with the column names - quantity
(g2<-pivot_longer(grocery,cols=2:5,names_to="store",
values_to="quantity"))
## # A tibble: 16 × 3
## fruit store quantity
## <chr> <chr> <dbl>
## 1 apples hyvee 50
## 2 apples publix 25
## 3 apples kroger 35
## 4 apples safeway 45
## 5 oranges hyvee 25
## 6 oranges publix 75
## 7 oranges kroger 80
## 8 oranges safeway 45
## 9 bananas hyvee 75
## 10 bananas publix 75
## 11 bananas kroger 25
## 12 bananas safeway 35
## 13 pears hyvee 30
## 14 pears publix 60
## 15 pears kroger 15
## 16 pears safeway 55
If we want this back in wide format, we use pivot_wider
, which takes three main arguments:
1. the data we are reshaping
2. the column that includes the new column names - names_from
3. the column that includes the values that will fill our new columns - values_from
pivot_wider(g2,names_from=store,values_from=quantity)
## # A tibble: 4 × 5
## fruit hyvee publix kroger safeway
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 apples 50 25 35 45
## 2 oranges 25 75 80 45
## 3 bananas 75 75 25 35
## 4 pears 30 60 15 55
In genomics, we often receive data in wide format. For example, you may be given RNAseq data from a colleague with the first column being sampleIDs and all additional columns being genes. The data frame itself holds count data.
At times we may need to work with a matrix while at others we may need a dataframe. These functions can be used to go from a df to a matrix and back again.
#take a look at fun_df again;
#we see genes in one column and expression values in another.
fun_df
## genes counts
## 1 A NA
## 2 A 214
## 3 A NA
## 4 B 352
## 5 B 256
## 6 B NA
## 7 C 400
## 8 C 381
## 9 C 250
## 10 D 278
## 11 D NA
## 12 D 169
#let's add a sample column and overwrite fun_df
#this data is in long format
fun_df<-data.frame(fun_df, sampleid=rep(c("A1","B1","C1"),4))
#let's convert to wide format using pivot_wider
fun_df_w<-fun_df %>%
pivot_wider(sampleid,names_from=genes,values_from=counts)
fun_df_w
## # A tibble: 3 × 5
## sampleid A B C D
## <chr> <int> <int> <int> <int>
## 1 A1 NA 352 400 278
## 2 B1 214 256 381 NA
## 3 C1 NA NA 250 169
#convert to matrix
fun_mat<-fun_df_w %>% column_to_rownames("sampleid") %>%
as.matrix(rownames.force=TRUE)
fun_mat
## A B C D
## A1 NA 352 400 278
## B1 214 256 381 NA
## C1 NA NA 250 169
#Now, we can easily apply functions that require data matrices.
There are other reasons you may be interested in using pivot_wider or pivot_longer. In my experience, most uses revolve around plotting criteria. For example, you may want to plot two different but related measurements on the same plot. You could pivot_longer so that those two measurements are now in the same column, stored as a variable.
Let's see how this might work with our scaled_counts data. We want to plot both "counts" and "counts_scaled" together in a density plot to understand the distribution of the data. Did scaling the counts improve the distribution?
#put counts and counts_scaled into a column named source
#with their values in a column named abundance
scounts_long<- scaled_counts %>% #getting the data
pivot_longer(cols = c("counts", "counts_scaled"),
names_to = "source", values_to = "abundance") #pivot
head(scounts_long)
## # A tibble: 6 × 18
## feature sample SampleName cell dex albut Run avgLength Experiment Sample
## <chr> <int> <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 ENSG000… 508 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345 SRS50…
## 2 ENSG000… 508 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345 SRS50…
## 3 ENSG000… 508 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345 SRS50…
## 4 ENSG000… 508 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345 SRS50…
## 5 ENSG000… 508 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345 SRS50…
## 6 ENSG000… 508 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345 SRS50…
## # … with 8 more variables: BioSample <chr>, transcript <chr>, ref_genome <chr>,
## # .abundant <lgl>, TMM <dbl>, multiplier <dbl>, source <chr>, abundance <dbl>
#PLot using ggplot2; It's not important to understand this code here.
ggplot(data=scounts_long, aes(x = abundance + 1, color = SampleName))+
geom_density() +
facet_wrap(~source) +
ggplot2::scale_x_log10() +
theme_bw()+
ylab("Density")+
xlab(expression('log'[10]*'Abundance'))
Note: There are other ways to reformat data in R. Check out the package reshape2
.
Exporting files from RStudio
Remember, because we are using RStudio server through DNAnexus, any files created by you today will be erased at the end of the session.
To use the materials you generated on the RServer on DNAnexus on your local computer, you will need to export your files using More
and Export
in the Files
pane.
Acknowledgements
Material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org and from a 2021 workshop entitled Introduction to Tidy Transciptomics by Maria Doyle and Stefano Mangiola. This lesson was also inspired by material from "Manipulating and analyzing data with dplyr", Introduction to data analysis with R and Bioconductor.