Introduction to dplyr
and the %>%
Objectives
Today we will begin to wrangle data using the tidyverse package, dplyr
. To this end, you will learn:
- how to filter data frames using
dplyr
- how to employ the pipe (
%>%
) operator to link functions
What is dplyr
?
The package dplyr tries to provide easy tools for the most common data manipulation tasks. It was built to work directly with data frames. The thinking behind it was largely inspired by the package plyr which has been in use for some time but suffered from being slow in some cases. --- datacarpentry.com
Read more about dplyr
at https://dplyr.tidyverse.org/articles/programming.html.
Loading dplyr
We do not need to load the dplyr
package separately, as it is a core tidyverse
package. If you need to install and load only dplyr
, use install.packages("dplyr")
and library(dplyr)
.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Importing data
For this lesson, we will use sample metadata and differential expression results from the airway
RNA-Seq project.
Let's begin by importing the data.
#sample information
smeta<-read_delim("./data/airway_sampleinfo.txt")
## Rows: 8 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
## dbl (1): avgLength
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
smeta
## # A tibble: 8 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0242…
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0242…
## 3 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0242…
## 4 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0242…
## 5 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0242…
## 6 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0242…
## 7 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0242…
## 8 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0242…
#let's use our differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")
## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dexp
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000… untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 3.12e-4
## 2 ENSG000… untrt DPM1 hg38 TRUE 0.198 4.61 6.90 2.81e-2
## 3 ENSG000… untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 7.63e-1
## 4 ENSG000… untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 5.55e-1
## 5 ENSG000… untrt CFH hg38 TRUE 0.417 8.09 29.3 4.63e-4
## 6 ENSG000… untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 4.05e-3
## 7 ENSG000… untrt GCLC hg38 TRUE -0.0581 4.84 0.167 6.92e-1
## 8 ENSG000… untrt NFYA hg38 TRUE -0.509 4.13 44.9 1.00e-4
## 9 ENSG000… untrt STPG1 hg38 TRUE -0.136 3.12 1.04 3.35e-1
## 10 ENSG000… untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 5.69e-1
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
We can get an idea of the structure of these data by using str()
or glimpse()
. glimpse()
, from tidyverse, is similar to str()
but provides somewhat cleaner output.
glimpse(smeta)
## Rows: 8
## Columns: 9
## $ SampleName <chr> "GSM1275862", "GSM1275863", "GSM1275866", "GSM1275867", "GS…
## $ cell <chr> "N61311", "N61311", "N052611", "N052611", "N080611", "N0806…
## $ dex <chr> "untrt", "trt", "untrt", "trt", "untrt", "trt", "untrt", "t…
## $ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "untr…
## $ Run <chr> "SRR1039508", "SRR1039509", "SRR1039512", "SRR1039513", "SR…
## $ avgLength <dbl> 126, 126, 126, 87, 120, 126, 101, 98
## $ Experiment <chr> "SRX384345", "SRX384346", "SRX384349", "SRX384350", "SRX384…
## $ Sample <chr> "SRS508568", "SRS508567", "SRS508571", "SRS508572", "SRS508…
## $ BioSample <chr> "SAMN02422669", "SAMN02422675", "SAMN02422678", "SAMN024226…
glimpse(dexp)
## Rows: 15,926
## Columns: 10
## $ feature <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E…
## $ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "untr…
## $ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL…
## $ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg…
## $ .abundant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ logFC <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417…
## $ logCPM <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,…
## $ F <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9…
## $ PValue <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0…
## $ FDR <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376…
Now that we have some data to work with, let's start subsetting.
Subsetting data in base R
Base R uses bracket notation for subsetting. For example, if we want to subset the data frame iris
to include only the first 5 rows and the first 3 columns, we could use
iris[1:5,1:3]
## Sepal.Length Sepal.Width Petal.Length
## 1 5.1 3.5 1.4
## 2 4.9 3.0 1.4
## 3 4.7 3.2 1.3
## 4 4.6 3.1 1.5
## 5 5.0 3.6 1.4
While this type of subsetting is useful, it is not always the most readable or easy to employ, especially for beginners. This is where dplyr
comes in. The dplyr
package in the tidyverse
world simplifies data wrangling with easy to employ and easy to understand functions specific for data manipulation in data frames.
Subsetting with dplyr
How can we select only columns of interest and rows of interest? We can use select()
and filter()
from dplyr
.
Subsetting by column (select()
)
To subset by column, we use the function select()
. We can include and exclude columns, reorder columns, and rename columns using select()
.
Select a few columns from our differential expression results (dexp
).
We can select the columns we are interested in by first calling the data frame object (dexp
) followed by the columns we want to select (transcript
,logFC
,FDR
). All arguments are separated by a comma. The order of the arguments will determine the order of the columns in the new data frame.
#select the gene / transcript, logFC, and FDR corrected p-value
#first argument is the df followed by columns to select
ex1<-select(dexp, transcript, logFC, FDR)
ex1
## # A tibble: 15,926 × 3
## transcript logFC FDR
## <chr> <dbl> <dbl>
## 1 TSPAN6 -0.390 0.00283
## 2 DPM1 0.198 0.0770
## 3 SCYL3 0.0292 0.844
## 4 C1orf112 -0.124 0.682
## 5 CFH 0.417 0.00376
## 6 FUCA2 -0.250 0.0186
## 7 GCLC -0.0581 0.794
## 8 NFYA -0.509 0.00126
## 9 STPG1 -0.136 0.478
## 10 NIPAL3 -0.0500 0.695
## # ℹ 15,916 more rows
We can rename while selecting.
#rename using the syntax new_name = old_name
ex1<-select(dexp, gene=transcript, logFoldChange = logFC, FDRpvalue=FDR)
ex1
## # A tibble: 15,926 × 3
## gene logFoldChange FDRpvalue
## <chr> <dbl> <dbl>
## 1 TSPAN6 -0.390 0.00283
## 2 DPM1 0.198 0.0770
## 3 SCYL3 0.0292 0.844
## 4 C1orf112 -0.124 0.682
## 5 CFH 0.417 0.00376
## 6 FUCA2 -0.250 0.0186
## 7 GCLC -0.0581 0.794
## 8 NFYA -0.509 0.00126
## 9 STPG1 -0.136 0.478
## 10 NIPAL3 -0.0500 0.695
## # ℹ 15,916 more rows
Note
If you want to retain all columns, you could also use rename()
from dplyr
to rename columns.
Excluding columns
We can select all columns, leaving out ones that do not interest us using a -
sign. This is helpful if the columns to keep far outweigh those to exclude. We can similarly use the !
to negate a selection.
ex2<-select(dexp, -feature)
ex2
## # A tibble: 15,926 × 9
## albut transcript ref_genome .abundant logFC logCPM F PValue FDR
## <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.00283
## 2 untrt DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0770
## 3 untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.844
## 4 untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.682
## 5 untrt CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.00376
## 6 untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0186
## 7 untrt GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.794
## 8 untrt NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.00126
## 9 untrt STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.478
## 10 untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.695
## # ℹ 15,916 more rows
ex2<-select(dexp, !feature)
ex2
## # A tibble: 15,926 × 9
## albut transcript ref_genome .abundant logFC logCPM F PValue FDR
## <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 untrt TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.00283
## 2 untrt DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0770
## 3 untrt SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.844
## 4 untrt C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.682
## 5 untrt CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.00376
## 6 untrt FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0186
## 7 untrt GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.794
## 8 untrt NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.00126
## 9 untrt STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.478
## 10 untrt NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.695
## # ℹ 15,916 more rows
We can reorder using select()
.
For readability, let's move the transcript column to the front.
#you can reorder columns and call a range of columns using select().
ex3<-select(dexp, transcript:FDR,albut)
ex3
## # A tibble: 15,926 × 9
## transcript ref_genome .abundant logFC logCPM F PValue FDR albut
## <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 TSPAN6 hg38 TRUE -0.390 5.06 32.8 0.000312 0.00283 untrt
## 2 DPM1 hg38 TRUE 0.198 4.61 6.90 0.0281 0.0770 untrt
## 3 SCYL3 hg38 TRUE 0.0292 3.48 0.0969 0.763 0.844 untrt
## 4 C1orf112 hg38 TRUE -0.124 1.47 0.377 0.555 0.682 untrt
## 5 CFH hg38 TRUE 0.417 8.09 29.3 0.000463 0.00376 untrt
## 6 FUCA2 hg38 TRUE -0.250 5.91 14.9 0.00405 0.0186 untrt
## 7 GCLC hg38 TRUE -0.0581 4.84 0.167 0.692 0.794 untrt
## 8 NFYA hg38 TRUE -0.509 4.13 44.9 0.000100 0.00126 untrt
## 9 STPG1 hg38 TRUE -0.136 3.12 1.04 0.335 0.478 untrt
## 10 NIPAL3 hg38 TRUE -0.0500 7.04 0.350 0.569 0.695 untrt
## # ℹ 15,916 more rows
Note
This also would have excluded the feature column.
Selecting a range of columns
Notice that we can select a range of columns using the :
. We could also deselect a range of columns or deselect a range of columns while adding a column back.
ex3<-select(dexp, -(albut:F),logFC)
ex3
## # A tibble: 15,926 × 4
## feature PValue FDR logFC
## <chr> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 0.000312 0.00283 -0.390
## 2 ENSG00000000419 0.0281 0.0770 0.198
## 3 ENSG00000000457 0.763 0.844 0.0292
## 4 ENSG00000000460 0.555 0.682 -0.124
## 5 ENSG00000000971 0.000463 0.00376 0.417
## 6 ENSG00000001036 0.00405 0.0186 -0.250
## 7 ENSG00000001084 0.692 0.794 -0.0581
## 8 ENSG00000001167 0.000100 0.00126 -0.509
## 9 ENSG00000001460 0.335 0.478 -0.136
## 10 ENSG00000001461 0.569 0.695 -0.0500
## # ℹ 15,916 more rows
Helper functions
We can also include helper functions such as starts_with()
and ends_with()
select(dexp, transcript, starts_with("log"), FDR)
## # A tibble: 15,926 × 4
## transcript logFC logCPM FDR
## <chr> <dbl> <dbl> <dbl>
## 1 TSPAN6 -0.390 5.06 0.00283
## 2 DPM1 0.198 4.61 0.0770
## 3 SCYL3 0.0292 3.48 0.844
## 4 C1orf112 -0.124 1.47 0.682
## 5 CFH 0.417 8.09 0.00376
## 6 FUCA2 -0.250 5.91 0.0186
## 7 GCLC -0.0581 4.84 0.794
## 8 NFYA -0.509 4.13 0.00126
## 9 STPG1 -0.136 3.12 0.478
## 10 NIPAL3 -0.0500 7.04 0.695
## # ℹ 15,916 more rows
There are a number of other selection helpers. See the help documentation for select
for more information ?dplyr::select()
.
Select columns of a particular type
There are many other ways to select multiple columns. You may commonly be interested in selecting all numeric columns or all factors. The syntax below can be used for this purpose.
select(dexp, where(is.numeric)) #or
## # A tibble: 15,926 × 5
## logFC logCPM F PValue FDR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.390 5.06 32.8 0.000312 0.00283
## 2 0.198 4.61 6.90 0.0281 0.0770
## 3 0.0292 3.48 0.0969 0.763 0.844
## 4 -0.124 1.47 0.377 0.555 0.682
## 5 0.417 8.09 29.3 0.000463 0.00376
## 6 -0.250 5.91 14.9 0.00405 0.0186
## 7 -0.0581 4.84 0.167 0.692 0.794
## 8 -0.509 4.13 44.9 0.000100 0.00126
## 9 -0.136 3.12 1.04 0.335 0.478
## 10 -0.0500 7.04 0.350 0.569 0.695
## # ℹ 15,916 more rows
select_if(dexp, is.numeric) #select_if is a scoped verb function
## # A tibble: 15,926 × 5
## logFC logCPM F PValue FDR
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.390 5.06 32.8 0.000312 0.00283
## 2 0.198 4.61 6.90 0.0281 0.0770
## 3 0.0292 3.48 0.0969 0.763 0.844
## 4 -0.124 1.47 0.377 0.555 0.682
## 5 0.417 8.09 29.3 0.000463 0.00376
## 6 -0.250 5.91 14.9 0.00405 0.0186
## 7 -0.0581 4.84 0.167 0.692 0.794
## 8 -0.509 4.13 44.9 0.000100 0.00126
## 9 -0.136 3.12 1.04 0.335 0.478
## 10 -0.0500 7.04 0.350 0.569 0.695
## # ℹ 15,916 more rows
Subsetting by row (filter()
)
To subset by row, we use the function filter()
.
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. ---R4DS
Now let's filter the rows from smeta
based on a condition. Let's look at only the treated samples in dex
(i.e., trt
) using the function filter()
. The first argument is the data frame (e.g., smeta
) followed by the expression(s) to filter the data frame.
filter(smeta, dex == "trt") #we've seen == notation before
To complete these filter phrases you will often need to include comparison operators such as the ==
above. These operators help us evaluate relations. For example, a == b
is asking if a
and b
are equivalent. It is a logical comparison that when evaluated will return TRUE or FALSE. The filter function will then return rows that evaluate to TRUE.
Try the following:
a <- 1
b <- 1
a == b
## [1] TRUE
Keep these comparison operators in mind for filtering.
Comparison operators
Comparison Operator | Description |
---|---|
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
!= | Not equal |
== | equal |
a | b | a or b |
a & b | a and b |
We may want to combine filtering parameters using AND or OR phrasing and the operators &
and |
.
For example, if we only wanted to return rows where dex == trt
and cell==N61311
, we can use:
filter(smeta, dex == "trt" & cell == "N61311")
## # A tibble: 1 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS50… SAMN0242…
,
is treated the same as &
in the case of filter()
.
filter(smeta, dex == "trt", cell == "N61311")
## # A tibble: 1 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS50… SAMN0242…
We can also filter by one condition or another using the |
.
filter(smeta,cell == "N080611" | cell == "N61311")
## # A tibble: 4 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275862 N61311 untrt untrt SRR10395… 126 SRX384345 SRS50… SAMN0242…
## 2 GSM1275863 N61311 trt untrt SRR10395… 126 SRX384346 SRS50… SAMN0242…
## 3 GSM1275870 N080611 untrt untrt SRR10395… 120 SRX384353 SRS50… SAMN0242…
## 4 GSM1275871 N080611 trt untrt SRR10395… 126 SRX384354 SRS50… SAMN0242…
The %in%
operator
Used to match elements of a vector.
%in% returns a logical vector indicating if there is a match or not for its left operand. --- match R Documentation.
The returned logical vector will be the length of the vector to the left. Its basic usage:
smeta$SampleName %in% c("GSM1275871","GSM1275863")
## [1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
c("GSM1275871","GSM1275863") %in% smeta$SampleName
## [1] TRUE TRUE
We can combine the %in%
operator with filter()
.
#filter for two cell lines
filter(smeta,cell %in% c("N061011", "N052611"))
## # A tibble: 4 × 9
## SampleName cell dex albut Run avgLength Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
## 1 GSM1275866 N052611 untrt untrt SRR10395… 126 SRX384349 SRS50… SAMN0242…
## 2 GSM1275867 N052611 trt untrt SRR10395… 87 SRX384350 SRS50… SAMN0242…
## 3 GSM1275874 N061011 untrt untrt SRR10395… 101 SRX384357 SRS50… SAMN0242…
## 4 GSM1275875 N061011 trt untrt SRR10395… 98 SRX384358 SRS50… SAMN0242…
Including multiple phrases
#use `|` operator
#look at only results with named genes (not NAs)
#and those with a log fold change greater than 2
#and either a p-value or an FDR corrected p_value < or = to 0.01
#The comma acts as &
sig_annot_transcripts<-
filter(dexp, !is.na(transcript),
abs(logFC) > 2, (PValue | FDR <= 0.01))
Filtering across columns
Past versions of dplyr included powerful variants of filter
, select
, and other functions to help perform tasks across columns. You may see functions such as filter_all
, filter_if
, and filter_at
. Functions like these can still be used but have been superseded by across
. However, across
has been deprecated in the case of filter and replaced by if_any()
and if_all()
.
Both functions operate similarly to across() but go the extra mile of aggregating the results to indicate if all the results are true when using if_all(), or if at least one is true when using if_any() ---tidyverse.org
Let's briefly see this in action.
f<-filter(dexp, if_all(PValue:FDR, ~ . < 0.05))
Anonymous functions
The code above includes an anonymous function. Read more here. You may also find this stackoverflow post useful.
Subsetting rows by position
There are times when you may want to subset your data by position, for example, the first or last number of rows. There are a series of functions in the tidyverse that facilitate this type of subsetting. The primary function is slice()
, which has several commonly used helper functions including slice_head()
, slice_tail()
, slice_min()
, and slice_max()
. See the slice()
documentation for more information.
Introducing the pipe
Often we will apply multiple functions to wrangle a data frame into the state that we need it. For example, maybe you want to select and filter. What are our options? We could run one step after another, saving an object for each step, or we could nest a function within a function, but these can affect code readability and clutter our work space, making it difficult to follow what we or someone else did.
For example,
#Run one step at a time with intermediate objects.
#We've done this a few times above
#select gene, logFC, FDR
dexp_s<-select(dexp, transcript, logFC, FDR)
#Now filter for only the genes "TSPAN6" and DPM1
#Note: we could have used %in%
tspanDpm<- filter(dexp_s, transcript == "TSPAN6" | transcript=="DPM1")
#Nested code example
tspanDpm<- filter(select(dexp, c(transcript, logFC, FDR)),
transcript == "TSPAN6" | transcript=="DPM1" )
Let's explore how piping streamlines this. Piping (using %>%
) allows you to employ multiple functions consecutively, while improving readability. The output of one function is passed directly to another without storing the intermediate steps as objects. You can pipe from the beginning (reading in the data) all the way to plotting without storing the data or intermediate objects, if you want. Pipes in R come from the magrittr
package, which is a dependency of dplyr
.
Pipe
Read more info about the magrittr pipe here. There is also a native R pipe, |>
, as of R 4.1.0
. Read more about the difference between %>%
and |>
here.
To pipe, we have to first call the data and then pipe it into a function. The output of each step is then piped into the next step.
Let's see how this works
tspanDpm <- dexp %>% #call the data and pipe to select()
select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) #filter
Notice that the data argument has been dropped from select()
and filter()
. This is because the pipe passes the input from the left to the right. The %>%
must be at the end of each line.
Piping from the beginning:
read_delim("./data/diffexp_results_edger_airways.txt") %>% #read data
select(transcript, logFC, FDR) %>% #select columns of interest
filter(transcript == "TSPAN6" | transcript=="DPM1" ) %>% #filter
ggplot(aes(x=transcript,y=logFC,fill=FDR)) + #plot
geom_bar(stat = "identity") +
theme_classic() +
geom_hline(yintercept=0, linetype="dashed", color = "black")
## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames. ---datacarpentry.org
Reordering rows
There are many steps that can be taken following subsetting (i.e., filtering by rows and columns); one of which is reordering rows. In the tidyverse, reordering rows is largely done by arrange()
. Arrange will reorder a variable from smallest to largest, or in the case of characters, alphabetically, from a to z.
Let's arrange the genes in dexp
.
dexp %>% arrange(transcript)
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG0000… untrt A1BG-AS1 hg38 TRUE 0.513 1.02 9.22 1.45e-2
## 2 ENSG0000… untrt A2M hg38 TRUE 0.528 10.1 3.57 9.24e-2
## 3 ENSG0000… untrt A2M-AS1 hg38 TRUE -0.337 0.308 2.76 1.32e-1
## 4 ENSG0000… untrt A4GALT hg38 TRUE 0.519 5.89 24.5 8.54e-4
## 5 ENSG0000… untrt AAAS hg38 TRUE -0.0254 5.12 0.134 7.23e-1
## 6 ENSG0000… untrt AACS hg38 TRUE -0.191 4.06 5.00 5.30e-2
## 7 ENSG0000… untrt AADAT hg38 TRUE -0.642 2.67 16.9 2.76e-3
## 8 ENSG0000… untrt AAGAB hg38 TRUE -0.165 5.08 5.82 3.98e-2
## 9 ENSG0000… untrt AAK1 hg38 TRUE -0.188 3.82 2.29 1.66e-1
## 10 ENSG0000… untrt AAMDC hg38 TRUE 0.447 2.42 8.52 1.75e-2
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
Let's arrange logFC
from smallest to largest.
dexp %>% arrange(logFC)
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000002… untrt LINC00906 hg38 TRUE -4.59 0.473 139. 1.13e-6
## 2 ENSG000001… untrt LRRTM2 hg38 TRUE -4.00 1.24 127. 1.64e-6
## 3 ENSG000001… untrt VASH2 hg38 TRUE -3.95 0.0171 152. 7.77e-7
## 4 ENSG000001… untrt VCAM1 hg38 TRUE -3.66 4.60 565. 2.87e-9
## 5 ENSG000001… untrt SLC14A1 hg38 TRUE -3.63 1.38 42.3 1.25e-4
## 6 ENSG000002… untrt FER1L6 hg38 TRUE -3.13 3.53 238. 1.18e-7
## 7 ENSG000001… untrt SMTNL2 hg38 TRUE -3.12 1.46 134. 1.29e-6
## 8 ENSG000001… untrt WNT2 hg38 TRUE -3.07 3.99 521. 4.09e-9
## 9 ENSG000001… untrt EGR2 hg38 TRUE -3.04 -0.141 96.1 5.11e-6
## 10 ENSG000001… untrt SLITRK6 hg38 TRUE -3.03 1.16 130. 1.46e-6
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
What if we want to arrange from largest to smallest? We can use desc()
.
dexp %>% arrange(desc(logFC))
## # A tibble: 15,926 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000… untrt ALOX15B hg38 TRUE 10.1 1.62 554. 5.92e- 7
## 2 ENSG00000… untrt ZBTB16 hg38 TRUE 7.15 4.15 1429. 5.11e-11
## 3 ENSG00000… untrt <NA> <NA> TRUE 6.17 1.35 380. 1.58e- 8
## 4 ENSG00000… untrt ANGPTL7 hg38 TRUE 5.68 3.51 483. 5.66e- 9
## 5 ENSG00000… untrt STEAP4 hg38 TRUE 5.22 3.66 445. 8.07e- 9
## 6 ENSG00000… untrt PRODH hg38 TRUE 4.85 1.29 253. 9.10e- 8
## 7 ENSG00000… untrt FAM107A hg38 TRUE 4.74 2.78 656. 1.51e- 9
## 8 ENSG00000… untrt LGI3 hg38 TRUE 4.68 -0.0503 106. 3.45e- 6
## 9 ENSG00000… untrt SPARCL1 hg38 TRUE 4.56 5.53 721. 1.00e- 9
## 10 ENSG00000… untrt KLF15 hg38 TRUE 4.48 4.69 479. 5.86e- 9
## # ℹ 15,916 more rows
## # ℹ 1 more variable: FDR <dbl>
Acknowledgments
Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Chapter 3, Wrangling Data in the Tidyverse, from Tidyverse Skills for Data Science and Suzan Baert's dplyr tutorials.