Subsetting Data with dplyr

Objectives

Today we will begin to wrangle data using the tidyverse package, dplyr. To this end, you will learn:

how to filter data frames using dplyr
how to employ the pipe (%>% or |>) operator to link functions

To get started with this lesson, you will first need to connect to RStudio on Biowulf. To connect to NIH HPC Open OnDemand, you must be on the NIH network. Use the following website to connect: https://hpcondemand.nih.gov/. Then follow the instructions outlined here.

What is `dplyr`?

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows. - dplyr.tidyverse.org

dplyr is also used to combine data tables sharing common IDs and to manipulate data in data backends.

Loading `dplyr`

We do not need to load the dplyr package separately, as it is a core tidyverse package. If you need to install and load only dplyr, use install.packages("dplyr") and library(dplyr).

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Importing data

For this lesson, we will use sample metadata and differential expression results derived from the airway RNA-Seq project. See here for instructions on accessing the data.

Let's begin by importing the data.

#sample information  
smeta<-read_delim("./data/airway_sampleinfo.txt")

Rows: 8 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (8): SampleName, cell, dex, albut, Run, Experiment, Sample, BioSample
dbl (1): avgLength

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

smeta

# A tibble: 8 × 9
  SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
  <chr>      <chr>   <chr> <chr> <chr>         <dbl> <chr>      <chr>  <chr>    
1 GSM1275862 N61311  untrt untrt SRR10395…       126 SRX384345  SRS50… SAMN0242…
2 GSM1275863 N61311  trt   untrt SRR10395…       126 SRX384346  SRS50… SAMN0242…
3 GSM1275866 N052611 untrt untrt SRR10395…       126 SRX384349  SRS50… SAMN0242…
4 GSM1275867 N052611 trt   untrt SRR10395…        87 SRX384350  SRS50… SAMN0242…
5 GSM1275870 N080611 untrt untrt SRR10395…       120 SRX384353  SRS50… SAMN0242…
6 GSM1275871 N080611 trt   untrt SRR10395…       126 SRX384354  SRS50… SAMN0242…
7 GSM1275874 N061011 untrt untrt SRR10395…       101 SRX384357  SRS50… SAMN0242…
8 GSM1275875 N061011 trt   untrt SRR10395…        98 SRX384358  SRS50… SAMN0242…

#let's use our differential expression results
dexp<-read_delim("./data/diffexp_results_edger_airways.txt")

Rows: 15926 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dexp

# A tibble: 15,926 × 10
   feature  albut transcript ref_genome .abundant   logFC logCPM       F  PValue
   <chr>    <chr> <chr>      <chr>      <lgl>       <dbl>  <dbl>   <dbl>   <dbl>
 1 ENSG000… untrt TSPAN6     hg38       TRUE      -0.390    5.06 32.8    3.12e-4
 2 ENSG000… untrt DPM1       hg38       TRUE       0.198    4.61  6.90   2.81e-2
 3 ENSG000… untrt SCYL3      hg38       TRUE       0.0292   3.48  0.0969 7.63e-1
 4 ENSG000… untrt C1orf112   hg38       TRUE      -0.124    1.47  0.377  5.55e-1
 5 ENSG000… untrt CFH        hg38       TRUE       0.417    8.09 29.3    4.63e-4
 6 ENSG000… untrt FUCA2      hg38       TRUE      -0.250    5.91 14.9    4.05e-3
 7 ENSG000… untrt GCLC       hg38       TRUE      -0.0581   4.84  0.167  6.92e-1
 8 ENSG000… untrt NFYA       hg38       TRUE      -0.509    4.13 44.9    1.00e-4
 9 ENSG000… untrt STPG1      hg38       TRUE      -0.136    3.12  1.04   3.35e-1
10 ENSG000… untrt NIPAL3     hg38       TRUE      -0.0500   7.04  0.350  5.69e-1
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>

We can get an idea of the structure of these data by using str() or glimpse(). glimpse(), from tidyverse, is similar to str() but provides somewhat cleaner output.

glimpse(smeta)

Rows: 8
Columns: 9
$ SampleName <chr> "GSM1275862", "GSM1275863", "GSM1275866", "GSM1275867", "GS…
$ cell       <chr> "N61311", "N61311", "N052611", "N052611", "N080611", "N0806…
$ dex        <chr> "untrt", "trt", "untrt", "trt", "untrt", "trt", "untrt", "t…
$ albut      <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "untr…
$ Run        <chr> "SRR1039508", "SRR1039509", "SRR1039512", "SRR1039513", "SR…
$ avgLength  <dbl> 126, 126, 126, 87, 120, 126, 101, 98
$ Experiment <chr> "SRX384345", "SRX384346", "SRX384349", "SRX384350", "SRX384…
$ Sample     <chr> "SRS508568", "SRS508567", "SRS508571", "SRS508572", "SRS508…
$ BioSample  <chr> "SAMN02422669", "SAMN02422675", "SAMN02422678", "SAMN024226…

glimpse(dexp)

Rows: 15,926
Columns: 10
$ feature    <chr> "ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "E…
$ albut      <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt", "untr…
$ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2", "GCL…
$ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg…
$ .abundant  <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
$ logFC      <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022, 0.417…
$ logCPM     <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.909668,…
$ F          <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-01, 2.9…
$ PValue     <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.5546956332, 0.0…
$ FDR        <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, 0.00376…

Always know how your data is structured.

Before you do anything with your data, always check out the structure of your data to avoid surprises.

Now that we have some data to work with, let's start subsetting.

Subsetting data in base R

If you remember back to "Getting Started with R", Base R uses bracket notation for subsetting.

For example, if we want to subset the data frame iris to include only the first 5 rows and the first 3 columns, we could use

iris[1:5,1:3]

  Sepal.Length Sepal.Width Petal.Length
1          5.1         3.5          1.4
2          4.9         3.0          1.4
3          4.7         3.2          1.3
4          4.6         3.1          1.5
5          5.0         3.6          1.4

While this type of subsetting is useful, it is not always the most readable or easy to employ, especially for beginners. This is where dplyr comes in. The dplyr package in the tidyverse world simplifies data wrangling with easy to employ and easy to understand functions specific for data manipulation in data frames.

Subsetting with `dplyr`

How can we select only columns of interest and rows of interest? We can use select() and filter() from dplyr.

Subsetting by column (`select()`)

To subset by column, we use the function select(). We can include and exclude columns, reorder columns, and rename columns using select().

Select a few columns from our differential expression results (dexp).

We can select the columns we are interested in by first calling the data frame object (dexp) followed by the columns we want to select (transcript,logFC,FDR). All arguments are separated by a comma. Just as in Base R subsetting, the order of the columns will determine the order of the columns in the new data frame.

Let's select the transcript, logFC, and FDR corrected p-value columns:

#first argument is the df followed by columns to select
ex1<-select(dexp, transcript, logFC, FDR) 
ex1

# A tibble: 15,926 × 3
   transcript   logFC     FDR
   <chr>        <dbl>   <dbl>
 1 TSPAN6     -0.390  0.00283
 2 DPM1        0.198  0.0770 
 3 SCYL3       0.0292 0.844  
 4 C1orf112   -0.124  0.682  
 5 CFH         0.417  0.00376
 6 FUCA2      -0.250  0.0186 
 7 GCLC       -0.0581 0.794  
 8 NFYA       -0.509  0.00126
 9 STPG1      -0.136  0.478  
10 NIPAL3     -0.0500 0.695  
# ℹ 15,916 more rows

We can rename while selecting.

The syntax to rename is new_name = old_name.

#rename using the syntax new_name = old_name
ex1<-select(dexp, gene=transcript, logFoldChange = logFC, FDRpvalue=FDR)
ex1

# A tibble: 15,926 × 3
   gene     logFoldChange FDRpvalue
   <chr>            <dbl>     <dbl>
 1 TSPAN6         -0.390    0.00283
 2 DPM1            0.198    0.0770 
 3 SCYL3           0.0292   0.844  
 4 C1orf112       -0.124    0.682  
 5 CFH             0.417    0.00376
 6 FUCA2          -0.250    0.0186 
 7 GCLC           -0.0581   0.794  
 8 NFYA           -0.509    0.00126
 9 STPG1          -0.136    0.478  
10 NIPAL3         -0.0500   0.695  
# ℹ 15,916 more rows

new name	old name
gene	transcript
logFoldChange	logFC
FDRpvalue	FDR

Using rename() or rename_with()

If you want to retain all columns, you could also use rename() from dplyr to rename columns.
For example, let's rename only transcript to gene from dexp.

rename(dexp, gene=transcript)

# A tibble: 15,926 × 10
   feature       albut gene  ref_genome .abundant   logFC logCPM       F  PValue
   <chr>         <chr> <chr> <chr>      <lgl>       <dbl>  <dbl>   <dbl>   <dbl>
 1 ENSG00000000… untrt TSPA… hg38       TRUE      -0.390    5.06 32.8    3.12e-4
 2 ENSG00000000… untrt DPM1  hg38       TRUE       0.198    4.61  6.90   2.81e-2
 3 ENSG00000000… untrt SCYL3 hg38       TRUE       0.0292   3.48  0.0969 7.63e-1
 4 ENSG00000000… untrt C1or… hg38       TRUE      -0.124    1.47  0.377  5.55e-1
 5 ENSG00000000… untrt CFH   hg38       TRUE       0.417    8.09 29.3    4.63e-4
 6 ENSG00000001… untrt FUCA2 hg38       TRUE      -0.250    5.91 14.9    4.05e-3
 7 ENSG00000001… untrt GCLC  hg38       TRUE      -0.0581   4.84  0.167  6.92e-1
 8 ENSG00000001… untrt NFYA  hg38       TRUE      -0.509    4.13 44.9    1.00e-4
 9 ENSG00000001… untrt STPG1 hg38       TRUE      -0.136    3.12  1.04   3.35e-1
10 ENSG00000001… untrt NIPA… hg38       TRUE      -0.0500   7.04  0.350  5.69e-1
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>

Excluding columns

We can select all columns, leaving out ones that do not interest us using a - sign. This is helpful if the columns to keep far outweigh those to exclude. We can similarly use the ! to negate a selection.

ex2<-select(dexp, -feature) 
ex2

# A tibble: 15,926 × 9
   albut transcript ref_genome .abundant   logFC logCPM       F   PValue     FDR
   <chr> <chr>      <chr>      <lgl>       <dbl>  <dbl>   <dbl>    <dbl>   <dbl>
 1 untrt TSPAN6     hg38       TRUE      -0.390    5.06 32.8    0.000312 0.00283
 2 untrt DPM1       hg38       TRUE       0.198    4.61  6.90   0.0281   0.0770 
 3 untrt SCYL3      hg38       TRUE       0.0292   3.48  0.0969 0.763    0.844  
 4 untrt C1orf112   hg38       TRUE      -0.124    1.47  0.377  0.555    0.682  
 5 untrt CFH        hg38       TRUE       0.417    8.09 29.3    0.000463 0.00376
 6 untrt FUCA2      hg38       TRUE      -0.250    5.91 14.9    0.00405  0.0186 
 7 untrt GCLC       hg38       TRUE      -0.0581   4.84  0.167  0.692    0.794  
 8 untrt NFYA       hg38       TRUE      -0.509    4.13 44.9    0.000100 0.00126
 9 untrt STPG1      hg38       TRUE      -0.136    3.12  1.04   0.335    0.478  
10 untrt NIPAL3     hg38       TRUE      -0.0500   7.04  0.350  0.569    0.695  
# ℹ 15,916 more rows

ex2<-select(dexp, !feature) 
ex2

# A tibble: 15,926 × 9
   albut transcript ref_genome .abundant   logFC logCPM       F   PValue     FDR
   <chr> <chr>      <chr>      <lgl>       <dbl>  <dbl>   <dbl>    <dbl>   <dbl>
 1 untrt TSPAN6     hg38       TRUE      -0.390    5.06 32.8    0.000312 0.00283
 2 untrt DPM1       hg38       TRUE       0.198    4.61  6.90   0.0281   0.0770 
 3 untrt SCYL3      hg38       TRUE       0.0292   3.48  0.0969 0.763    0.844  
 4 untrt C1orf112   hg38       TRUE      -0.124    1.47  0.377  0.555    0.682  
 5 untrt CFH        hg38       TRUE       0.417    8.09 29.3    0.000463 0.00376
 6 untrt FUCA2      hg38       TRUE      -0.250    5.91 14.9    0.00405  0.0186 
 7 untrt GCLC       hg38       TRUE      -0.0581   4.84  0.167  0.692    0.794  
 8 untrt NFYA       hg38       TRUE      -0.509    4.13 44.9    0.000100 0.00126
 9 untrt STPG1      hg38       TRUE      -0.136    3.12  1.04   0.335    0.478  
10 untrt NIPAL3     hg38       TRUE      -0.0500   7.04  0.350  0.569    0.695  
# ℹ 15,916 more rows

We can reorder using `select()`.

For readability, let's move the transcript column to the front.

#you can reorder columns and call a range of columns using select().
ex3<-select(dexp, transcript:FDR,albut)  
ex3

# A tibble: 15,926 × 9
   transcript ref_genome .abundant   logFC logCPM       F   PValue     FDR albut
   <chr>      <chr>      <lgl>       <dbl>  <dbl>   <dbl>    <dbl>   <dbl> <chr>
 1 TSPAN6     hg38       TRUE      -0.390    5.06 32.8    0.000312 0.00283 untrt
 2 DPM1       hg38       TRUE       0.198    4.61  6.90   0.0281   0.0770  untrt
 3 SCYL3      hg38       TRUE       0.0292   3.48  0.0969 0.763    0.844   untrt
 4 C1orf112   hg38       TRUE      -0.124    1.47  0.377  0.555    0.682   untrt
 5 CFH        hg38       TRUE       0.417    8.09 29.3    0.000463 0.00376 untrt
 6 FUCA2      hg38       TRUE      -0.250    5.91 14.9    0.00405  0.0186  untrt
 7 GCLC       hg38       TRUE      -0.0581   4.84  0.167  0.692    0.794   untrt
 8 NFYA       hg38       TRUE      -0.509    4.13 44.9    0.000100 0.00126 untrt
 9 STPG1      hg38       TRUE      -0.136    3.12  1.04   0.335    0.478   untrt
10 NIPAL3     hg38       TRUE      -0.0500   7.04  0.350  0.569    0.695   untrt
# ℹ 15,916 more rows

#Note: this also would have excluded the feature column

If we are interested in moving a column without selection, we can use relocate(). We should include the columns we want to move and where we would like to put them.

relocate(dexp, transcript, .before=feature)

# A tibble: 15,926 × 10
   transcript feature  albut ref_genome .abundant   logFC logCPM       F  PValue
   <chr>      <chr>    <chr> <chr>      <lgl>       <dbl>  <dbl>   <dbl>   <dbl>
 1 TSPAN6     ENSG000… untrt hg38       TRUE      -0.390    5.06 32.8    3.12e-4
 2 DPM1       ENSG000… untrt hg38       TRUE       0.198    4.61  6.90   2.81e-2
 3 SCYL3      ENSG000… untrt hg38       TRUE       0.0292   3.48  0.0969 7.63e-1
 4 C1orf112   ENSG000… untrt hg38       TRUE      -0.124    1.47  0.377  5.55e-1
 5 CFH        ENSG000… untrt hg38       TRUE       0.417    8.09 29.3    4.63e-4
 6 FUCA2      ENSG000… untrt hg38       TRUE      -0.250    5.91 14.9    4.05e-3
 7 GCLC       ENSG000… untrt hg38       TRUE      -0.0581   4.84  0.167  6.92e-1
 8 NFYA       ENSG000… untrt hg38       TRUE      -0.509    4.13 44.9    1.00e-4
 9 STPG1      ENSG000… untrt hg38       TRUE      -0.136    3.12  1.04   3.35e-1
10 NIPAL3     ENSG000… untrt hg38       TRUE      -0.0500   7.04  0.350  5.69e-1
# ℹ 15,916 more rows
# ℹ 1 more variable: FDR <dbl>

Note

By default, relocate() will move columns to the left-hand side of the data frame.

Selecting a range of columns

Notice that we can select a range of columns using the :. We could also deselect a range of columns or deselect a range of columns while adding a column back.

ex3<-select(dexp, -(albut:F),logFC)
ex3

# A tibble: 15,926 × 4
   feature           PValue     FDR   logFC
   <chr>              <dbl>   <dbl>   <dbl>
 1 ENSG00000000003 0.000312 0.00283 -0.390 
 2 ENSG00000000419 0.0281   0.0770   0.198 
 3 ENSG00000000457 0.763    0.844    0.0292
 4 ENSG00000000460 0.555    0.682   -0.124 
 5 ENSG00000000971 0.000463 0.00376  0.417 
 6 ENSG00000001036 0.00405  0.0186  -0.250 
 7 ENSG00000001084 0.692    0.794   -0.0581
 8 ENSG00000001167 0.000100 0.00126 -0.509 
 9 ENSG00000001460 0.335    0.478   -0.136 
10 ENSG00000001461 0.569    0.695   -0.0500
# ℹ 15,916 more rows

Helper functions

We can also include helper functions such as starts_with() and ends_with(), and operators (!, &, |) for combining selections.

select(dexp, transcript, starts_with("log"), FDR)

# A tibble: 15,926 × 4
   transcript   logFC logCPM     FDR
   <chr>        <dbl>  <dbl>   <dbl>
 1 TSPAN6     -0.390    5.06 0.00283
 2 DPM1        0.198    4.61 0.0770 
 3 SCYL3       0.0292   3.48 0.844  
 4 C1orf112   -0.124    1.47 0.682  
 5 CFH         0.417    8.09 0.00376
 6 FUCA2      -0.250    5.91 0.0186 
 7 GCLC       -0.0581   4.84 0.794  
 8 NFYA       -0.509    4.13 0.00126
 9 STPG1      -0.136    3.12 0.478  
10 NIPAL3     -0.0500   7.04 0.695  
# ℹ 15,916 more rows

#or 
select(dexp, transcript, starts_with("log") | ends_with("r"))

# A tibble: 15,926 × 4
   transcript   logFC logCPM     FDR
   <chr>        <dbl>  <dbl>   <dbl>
 1 TSPAN6     -0.390    5.06 0.00283
 2 DPM1        0.198    4.61 0.0770 
 3 SCYL3       0.0292   3.48 0.844  
 4 C1orf112   -0.124    1.47 0.682  
 5 CFH         0.417    8.09 0.00376
 6 FUCA2      -0.250    5.91 0.0186 
 7 GCLC       -0.0581   4.84 0.794  
 8 NFYA       -0.509    4.13 0.00126
 9 STPG1      -0.136    3.12 0.478  
10 NIPAL3     -0.0500   7.04 0.695  
# ℹ 15,916 more rows

There are a number of other selection helpers. See the help documentation for select for more information (?dplyr::select()) or this reference from tidyselect.

Select columns of a particular type

There are many other ways to select multiple columns. You may commonly be interested in selecting all numeric columns or all factors. The syntax below can be used for this purpose.

select(dexp, where(is.numeric)) #or

# A tibble: 15,926 × 5
     logFC logCPM       F   PValue     FDR
     <dbl>  <dbl>   <dbl>    <dbl>   <dbl>
 1 -0.390    5.06 32.8    0.000312 0.00283
 2  0.198    4.61  6.90   0.0281   0.0770 
 3  0.0292   3.48  0.0969 0.763    0.844  
 4 -0.124    1.47  0.377  0.555    0.682  
 5  0.417    8.09 29.3    0.000463 0.00376
 6 -0.250    5.91 14.9    0.00405  0.0186 
 7 -0.0581   4.84  0.167  0.692    0.794  
 8 -0.509    4.13 44.9    0.000100 0.00126
 9 -0.136    3.12  1.04   0.335    0.478  
10 -0.0500   7.04  0.350  0.569    0.695  
# ℹ 15,916 more rows

# Not recommended
select_if(dexp, is.numeric) #scoped verbs are superseded

# A tibble: 15,926 × 5
     logFC logCPM       F   PValue     FDR
     <dbl>  <dbl>   <dbl>    <dbl>   <dbl>
 1 -0.390    5.06 32.8    0.000312 0.00283
 2  0.198    4.61  6.90   0.0281   0.0770 
 3  0.0292   3.48  0.0969 0.763    0.844  
 4 -0.124    1.47  0.377  0.555    0.682  
 5  0.417    8.09 29.3    0.000463 0.00376
 6 -0.250    5.91 14.9    0.00405  0.0186 
 7 -0.0581   4.84  0.167  0.692    0.794  
 8 -0.509    4.13 44.9    0.000100 0.00126
 9 -0.136    3.12  1.04   0.335    0.478  
10 -0.0500   7.04  0.350  0.569    0.695  
# ℹ 15,916 more rows

Subsetting by row (`filter()`)

To subset by row, we use the function filter().

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. ---R4DS

Now let's filter the rows from smeta based on a condition. Let's look at only the treated samples in dex (i.e., trt) using the function filter(). The first argument is the data frame (e.g., smeta) followed by the expression(s) to filter the data frame.

filter(smeta, dex == "trt") #we've seen == notation before

To complete these filter phrases you will often need to include comparison operators such as the == above. These operators help us evaluate relations. For example, a == b is asking if a and b are equivalent. It is a logical comparison that when evaluated will return TRUE or FALSE. The filter function will then return rows that evaluate to TRUE.

Try the following:

a <- 1
b <- 1
a == b

[1] TRUE

Keep these comparison operators in mind for filtering.

Comparison operators

Comparison Operator	Description
>	greater than
>=	greater than or equal to
<	less than
<=	less than or equal to
!=	Not equal
==	equal
a \| b	a or b
a & b	a and b

We may want to combine filtering parameters using AND or OR phrasing and the operators & and |.

For example, if we only wanted to return rows where dex == trt and cell==N61311, we can use:

filter(smeta, dex == "trt" & cell == "N61311")

# A tibble: 1 × 9
  SampleName cell   dex   albut Run        avgLength Experiment Sample BioSample
  <chr>      <chr>  <chr> <chr> <chr>          <dbl> <chr>      <chr>  <chr>    
1 GSM1275863 N61311 trt   untrt SRR1039509       126 SRX384346  SRS50… SAMN0242…

A , is treated the same as & in the case of filter().

filter(smeta, dex == "trt", cell == "N61311")

# A tibble: 1 × 9
  SampleName cell   dex   albut Run        avgLength Experiment Sample BioSample
  <chr>      <chr>  <chr> <chr> <chr>          <dbl> <chr>      <chr>  <chr>    
1 GSM1275863 N61311 trt   untrt SRR1039509       126 SRX384346  SRS50… SAMN0242…

We can also filter by one condition or another using the |.

filter(smeta,cell == "N080611" | cell == "N61311")

# A tibble: 4 × 9
  SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
  <chr>      <chr>   <chr> <chr> <chr>         <dbl> <chr>      <chr>  <chr>    
1 GSM1275862 N61311  untrt untrt SRR10395…       126 SRX384345  SRS50… SAMN0242…
2 GSM1275863 N61311  trt   untrt SRR10395…       126 SRX384346  SRS50… SAMN0242…
3 GSM1275870 N080611 untrt untrt SRR10395…       120 SRX384353  SRS50… SAMN0242…
4 GSM1275871 N080611 trt   untrt SRR10395…       126 SRX384354  SRS50… SAMN0242…

The `%in%` operator

Used to match elements of a vector.

%in% returns a logical vector indicating if there is a match or not for its left operand. --- match R Documentation.

The returned logical vector will be the length of the vector to the left. Its basic usage:

smeta$SampleName %in% c("GSM1275871","GSM1275863")

[1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE

c("GSM1275871","GSM1275863") %in% smeta$SampleName

[1] TRUE TRUE

We can combine the %in% operator with filter().

#filter for two cell lines
filter(smeta,cell %in% c("N061011", "N052611"))

# A tibble: 4 × 9
  SampleName cell    dex   albut Run       avgLength Experiment Sample BioSample
  <chr>      <chr>   <chr> <chr> <chr>         <dbl> <chr>      <chr>  <chr>    
1 GSM1275866 N052611 untrt untrt SRR10395…       126 SRX384349  SRS50… SAMN0242…
2 GSM1275867 N052611 trt   untrt SRR10395…        87 SRX384350  SRS50… SAMN0242…
3 GSM1275874 N061011 untrt untrt SRR10395…       101 SRX384357  SRS50… SAMN0242…
4 GSM1275875 N061011 trt   untrt SRR10395…        98 SRX384358  SRS50… SAMN0242…

Including multiple phrases

We can use multiple expressions in a single call to filter(). For example, let's filter dexp to include only named transcripts (i.e.,no NAs), values of |log fold change| is greater than 2, and either a p-value or FDR corrected p_value is less than or equal to 0.01.

#use `|` operator 
#look at only results with named genes (not NAs) 
#and those with a log fold change greater than 2 
#and either a p-value or an FDR corrected p_value < or = to 0.01
#The comma acts as &
sig_annot_transcripts<-
  filter(dexp, !is.na(transcript),
         abs(logFC) > 2, (PValue | FDR <= 0.01))
sig_annot_transcripts

# A tibble: 178 × 10
   feature     albut transcript ref_genome .abundant logFC  logCPM     F  PValue
   <chr>       <chr> <chr>      <chr>      <lgl>     <dbl>   <dbl> <dbl>   <dbl>
 1 ENSG000000… untrt PDK4       hg38       TRUE       2.55  5.41    37.5 1.94e-4
 2 ENSG000000… untrt SLC7A14    hg38       TRUE      -2.89  3.54   277.  6.13e-8
 3 ENSG000000… untrt NPC1L1     hg38       TRUE      -2.61 -0.0372  63.6 2.65e-5
 4 ENSG000000… untrt CHDH       hg38       TRUE      -2.01  2.14   112.  2.77e-6
 5 ENSG000000… untrt HSD17B6    hg38       TRUE      -2.03  3.02    98.6 4.60e-6
 6 ENSG000000… untrt POU2F2     hg38       TRUE      -2.06  0.835  104.  3.65e-6
 7 ENSG000000… untrt GPM6B      hg38       TRUE       2.43  5.67   177.  4.08e-7
 8 ENSG000000… untrt PER3       hg38       TRUE      -2.21  3.22    80.5 1.04e-5
 9 ENSG000000… untrt COL11A1    hg38       TRUE       2.41  4.06   404.  1.23e-8
10 ENSG000000… untrt FGFR2      hg38       TRUE      -2.26  0.499   60.0 3.31e-5
# ℹ 168 more rows
# ℹ 1 more variable: FDR <dbl>

Filtering across columns

Past versions of dplyr included powerful variants of filter, select, and other functions to help perform tasks across columns. You may see functions such as filter_all, filter_if, and filter_at. Functions like these can still be used but have been superseded by across. However, across has been deprecated in the case of filter and replaced by if_any() and if_all().

Both functions operate similarly to across() but go the extra mile of aggregating the results to indicate if all the results are true when using if_all(), or if at least one is true when using if_any() ---tidyverse.org

Let's briefly see this in action. Let's return only rows with values of less than 0.05 in the columns PValue and FDR.

f<-filter(dexp, if_all(PValue:FDR, ~ .x < 0.05))
f

# A tibble: 4,967 × 10
   feature     albut transcript ref_genome .abundant  logFC logCPM     F  PValue
   <chr>       <chr> <chr>      <chr>      <lgl>      <dbl>  <dbl> <dbl>   <dbl>
 1 ENSG000000… untrt TSPAN6     hg38       TRUE      -0.390   5.06 32.8  3.12e-4
 2 ENSG000000… untrt CFH        hg38       TRUE       0.417   8.09 29.3  4.63e-4
 3 ENSG000000… untrt FUCA2      hg38       TRUE      -0.250   5.91 14.9  4.05e-3
 4 ENSG000000… untrt NFYA       hg38       TRUE      -0.509   4.13 44.9  1.00e-4
 5 ENSG000000… untrt SEMA3F     hg38       TRUE      -0.259   4.81 12.3  6.98e-3
 6 ENSG000000… untrt ANKIB1     hg38       TRUE      -0.236   6.38 14.5  4.41e-3
 7 ENSG000000… untrt RAD52      hg38       TRUE      -0.319   3.13  9.03 1.53e-2
 8 ENSG000000… untrt LASP1      hg38       TRUE       0.388   8.39 22.7  1.11e-3
 9 ENSG000000… untrt SNX11      hg38       TRUE       0.395   3.56 18.7  2.05e-3
10 ENSG000000… untrt TMEM176A   hg38       TRUE       0.357   4.65 12.1  7.30e-3
# ℹ 4,957 more rows
# ℹ 1 more variable: FDR <dbl>

Anonymous function

The code above includes an anonymous function. Read more here. You may also find this Stack Overflow post useful.
Therefore, the above line could have been written as follows: This function could be written like this:

my_func <- function(x) {
  x < 0.05
}

filter(dexp, if_all(PValue:FDR, my_func))

# A tibble: 4,967 × 10
  feature     albut transcript ref_genome .abundant  logFC logCPM     F  PValue
  <chr>       <chr> <chr>      <chr>      <lgl>      <dbl>  <dbl> <dbl>   <dbl>
1 ENSG000000… untrt TSPAN6     hg38       TRUE      -0.390   5.06 32.8  3.12e-4
2 ENSG000000… untrt CFH        hg38       TRUE       0.417   8.09 29.3  4.63e-4
3 ENSG000000… untrt FUCA2      hg38       TRUE      -0.250   5.91 14.9  4.05e-3
4 ENSG000000… untrt NFYA       hg38       TRUE      -0.509   4.13 44.9  1.00e-4
5 ENSG000000… untrt SEMA3F     hg38       TRUE      -0.259   4.81 12.3  6.98e-3
6 ENSG000000… untrt ANKIB1     hg38       TRUE      -0.236   6.38 14.5  4.41e-3
7 ENSG000000… untrt RAD52      hg38       TRUE      -0.319   3.13  9.03 1.53e-2
8 ENSG000000… untrt LASP1      hg38       TRUE       0.388   8.39 22.7  1.11e-3
9 ENSG000000… untrt SNX11      hg38       TRUE       0.395   3.56 18.7  2.05e-3
10 ENSG000000… untrt TMEM176A   hg38       TRUE       0.357   4.65 12.1  7.30e-3
# ℹ 4,957 more rows
# ℹ 1 more variable: FDR <dbl>

Subsetting rows by position

There are times when you may want to subset your data by position, for example, the first or last number of rows. There are a series of functions in the tidyverse that facilitate this type of subsetting. The primary function is slice(), which has several commonly used helper functions including slice_head(), slice_tail(), slice_min(), and slice_max(). See the slice() documentation for more information.

Introducing the pipe

Often we will apply multiple functions to wrangle a data frame into the state that we need it. For example, maybe you want to select and filter. What are our options? We could run one step after another, saving an object for each step, or we could nest a function within a function, but these can affect code readability and clutter our work space, making it difficult to follow what we or someone else did.

Step by Step

#Run one step at a time with intermediate objects. 
#We've done this a few times above
#select gene, logFC, FDR
dexp_s<-select(dexp, transcript, logFC, FDR)

#Now filter for only the genes "TSPAN6" and DPM1
#Note: we could have used %in%
tspanDpm<- filter(dexp_s, transcript == "TSPAN6" | transcript=="DPM1")

Nesting Code

#Nested code example
tspanDpm<- filter(select(dexp, c(transcript, logFC, FDR)), 
                  transcript == "TSPAN6" | transcript=="DPM1" )

Using the pipe (`%>%`,`|>`)

Let's explore how piping streamlines this. Piping (using %>%) allows you to employ multiple functions consecutively, while improving readability. The output of one function is passed directly to another without storing the intermediate steps as objects. You can pipe from the beginning (reading in the data) all the way to plotting without storing the data or intermediate objects, if you want. You can use either the magrittr pipe (%>%), which loads with the tidyverse, or the native R pipe (|>, R version +4.1).

!!! info %>% vs |> These pipes behave in largely the same way. However, %>% does have some special behaviors. You can read more here

To pipe, we have to first call the data and then pipe it into a function. The output of each step is then piped into the next step.

Let's see how this works

 dexp %>% #call the data and pipe to select()
  select(transcript, logFC, FDR) |> #select columns of interest 
  filter(transcript == "TSPAN6" | transcript=="DPM1" ) #filter

# A tibble: 2 × 3
  transcript  logFC     FDR
  <chr>       <dbl>   <dbl>
1 TSPAN6     -0.390 0.00283
2 DPM1        0.198 0.0770

Notice that the data argument has been dropped from select() and filter(). This is because the pipe passes the input from the left to the right. The %>% must be at the end of each line.

Piping from the beginning:

read_delim("./data/diffexp_results_edger_airways.txt") |> #read data
  select(transcript, logFC, FDR) |> #select columns of interest 
  filter(transcript == "TSPAN6" | transcript=="DPM1" ) |> #filter 
  ggplot(aes(x=transcript,y=logFC,fill=FDR)) + #plot
  geom_bar(stat = "identity") +
  theme_classic() + 
  geom_hline(yintercept=0, linetype="dashed", color = "black")

Rows: 15926 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): feature, albut, transcript, ref_genome
dbl (5): logFC, logCPM, F, PValue, FDR
lgl (1): .abundant

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note

ggplot2 will be covered in Part 3 of this course.

The dplyr functions by themselves are somewhat simple, but by combining them into linear workflows with the pipe, we can accomplish more complex manipulations of data frames. ---datacarpentry.org

Acknowledgments

Some material from this lesson was either taken directly or adapted from the Intro to R and RStudio for Genomics lesson provided by datacarpentry.org. Additional content was inspired by Chapter 3, Wrangling Data in the Tidyverse, from Tidyverse Skills for Data Science and Suzan Baert's dplyr tutorials.