Skip to content

Select and Filter

All solutions use the pipe. Solutions have multiple possibilities.

Q1. Import the file "./data/filtlowabund_scaledcounts_airways.txt" and save to an object named sc. Create a data frame from sc that only includes the columns sample, cell, dex, transcript, and counts_scaled and only rows that include the treatment "untrt" and the transcripts "ACTN1" and "ANAPC4"?

Q1 Solution
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sc <- read_delim("../data/filtlowabund_scaledcounts_airways.txt")
## Rows: 127408 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (11): feature, SampleName, cell, dex, albut, Run, Experiment, Sample, Bi...
## dbl  (6): sample, counts, avgLength, TMM, multiplier, counts_scaled
## lgl  (1): .abundant
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cnames <- c('sample', 'cell', 'dex', 'transcript', 'counts_scaled')

sc %>% select(all_of(cnames)) %>% filter(dex == "untrt" & (transcript %in% c("ACTN1","ANAPC4") )) 
## # A tibble: 8 × 5
##   sample cell    dex   transcript counts_scaled
##    <dbl> <chr>   <chr> <chr>              <dbl>
## 1    508 N61311  untrt ANAPC4              777.
## 2    508 N61311  untrt ACTN1             14410.
## 3    512 N052611 untrt ANAPC4              786.
## 4    512 N052611 untrt ACTN1             16644.
## 5    516 N080611 untrt ANAPC4              709.
## 6    516 N080611 untrt ACTN1             15805.
## 7    520 N061011 untrt ANAPC4              827.
## 8    520 N061011 untrt ACTN1             16015.

Q2. Using dexp ("./data/diffexp_results_edger_airways.txt") create a data frame containing the top 5 differentially expressed genes and save to an object named top5. Top genes in this case will have the smallest FDR corrected p-value and an absolute value of the log fold change greater than 2. See dplyr::slice().

Q2 Solution
dexp<-read_delim("../data/diffexp_results_edger_airways.txt")
## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
top5<- dexp %>% 
  dplyr::filter(abs(logFC) > 2) %>% 
  slice_min(n=5,order_by=FDR, with_ties=FALSE)
top5
## # A tibble: 5 × 10
##   feature      albut transcript ref_genome .abundant logFC logCPM     F   PValue
##   <chr>        <chr> <chr>      <chr>      <lgl>     <dbl>  <dbl> <dbl>    <dbl>
## 1 ENSG0000010… untrt ZBTB16     hg38       TRUE       7.15   4.15 1429. 5.11e-11
## 2 ENSG0000016… untrt CACNB2     hg38       TRUE       3.28   4.51 1575. 3.34e-11
## 3 ENSG0000012… untrt DUSP1      hg38       TRUE       2.94   7.31  694. 1.18e- 9
## 4 ENSG0000014… untrt PRSS35     hg38       TRUE      -2.76   3.91  807. 6.16e-10
## 5 ENSG0000015… untrt SPARCL1    hg38       TRUE       4.56   5.53  721. 1.00e- 9
## # ℹ 1 more variable: FDR <dbl>

Q3. Filter sc to contain only the top 5 differentially expressed genes.

Q3 Solution
sc %>% dplyr::filter(transcript %in%  top5$transcript)
## # A tibble: 40 × 18
##    feature sample counts SampleName cell  dex   albut Run   avgLength Experiment
##    <chr>    <dbl>  <dbl> <chr>      <chr> <chr> <chr> <chr>     <dbl> <chr>     
##  1 ENSG00…    508      4 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  2 ENSG00…    508    665 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  3 ENSG00…    508    330 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  4 ENSG00…    508     62 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  5 ENSG00…    508     80 GSM1275862 N613… untrt untrt SRR1…       126 SRX384345 
##  6 ENSG00…    509    739 GSM1275863 N613… trt   untrt SRR1…       126 SRX384346 
##  7 ENSG00…    509   5020 GSM1275863 N613… trt   untrt SRR1…       126 SRX384346 
##  8 ENSG00…    509     41 GSM1275863 N613… trt   untrt SRR1…       126 SRX384346 
##  9 ENSG00…    509   2040 GSM1275863 N613… trt   untrt SRR1…       126 SRX384346 
## 10 ENSG00…    509    731 GSM1275863 N613… trt   untrt SRR1…       126 SRX384346 
## # ℹ 30 more rows
## # ℹ 8 more variables: Sample <chr>, BioSample <chr>, transcript <chr>,
## #   ref_genome <chr>, .abundant <lgl>, TMM <dbl>, multiplier <dbl>,
## #   counts_scaled <dbl>

Q4. Select only columns of type character from sc.

Q4 Solution
sc %>% select(where(is.character))
## # A tibble: 127,408 × 11
##    feature        SampleName cell  dex   albut Run   Experiment Sample BioSample
##    <chr>          <chr>      <chr> <chr> <chr> <chr> <chr>      <chr>  <chr>    
##  1 ENSG000000000… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  2 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  3 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  4 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  5 ENSG000000009… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  6 ENSG000000010… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  7 ENSG000000010… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  8 ENSG000000011… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
##  9 ENSG000000014… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
## 10 ENSG000000014… GSM1275862 N613… untrt untrt SRR1… SRX384345  SRS50… SAMN0242…
## # ℹ 127,398 more rows
## # ℹ 2 more variables: transcript <chr>, ref_genome <chr>

Q5. Select all columns from dexp except .abundant and PValue. Keep only rows with FDR less than or equal to 0.01.

Q5 Solution
dexp %>% select(-c(.abundant,PValue)) %>% filter(FDR <= 0.01)
## # A tibble: 2,763 × 8
##    feature         albut transcript ref_genome  logFC logCPM     F       FDR
##    <chr>           <chr> <chr>      <chr>       <dbl>  <dbl> <dbl>     <dbl>
##  1 ENSG00000000003 untrt TSPAN6     hg38       -0.390  5.06   32.8 0.00283  
##  2 ENSG00000000971 untrt CFH        hg38        0.417  8.09   29.3 0.00376  
##  3 ENSG00000001167 untrt NFYA       hg38       -0.509  4.13   44.9 0.00126  
##  4 ENSG00000002834 untrt LASP1      hg38        0.388  8.39   22.7 0.00722  
##  5 ENSG00000003096 untrt KLHL13     hg38       -0.949  4.16   84.8 0.000234 
##  6 ENSG00000003402 untrt CFLAR      hg38        1.18   6.90  130.  0.0000800
##  7 ENSG00000003987 untrt MTMR7      hg38        0.993  0.341  24.7 0.00585  
##  8 ENSG00000004059 untrt ARF5       hg38        0.358  5.84   30.9 0.00328  
##  9 ENSG00000004487 untrt KDM1A      hg38       -0.308  5.86   23.5 0.00663  
## 10 ENSG00000004700 untrt RECQL      hg38        0.360  5.60   22.7 0.00721  
## # ℹ 2,753 more rows

Q6. Import the file "./data/airway_rawcount.csv". Use the function rename() to rename the first column. Use the pipe to import and rename successively without intermediate steps or function nesting. Save to an object named acount.

Q6 Solution
acount<-read_csv("../data/airway_rawcount.csv") %>%
  dplyr::rename(Feature = ...1)
## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
acount
## # A tibble: 64,102 × 9
##    Feature     SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 ENSG000000…        679        448        873        408       1138       1047
##  2 ENSG000000…          0          0          0          0          0          0
##  3 ENSG000000…        467        515        621        365        587        799
##  4 ENSG000000…        260        211        263        164        245        331
##  5 ENSG000000…         60         55         40         35         78         63
##  6 ENSG000000…          0          0          2          0          1          0
##  7 ENSG000000…       3251       3679       6177       4252       6721      11027
##  8 ENSG000000…       1433       1062       1733        881       1424       1439
##  9 ENSG000000…        519        380        595        493        820        714
## 10 ENSG000000…        394        236        464        175        658        584
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>

Q7. Use filter on the object acount to keep only genes that had a count greater than 10 in at least one sample.

Q7 Solution
acount %>% 
  filter(if_any(where(is.numeric), ~.> 10)) 
## # A tibble: 17,792 × 9
##    Feature     SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 ENSG000000…        679        448        873        408       1138       1047
##  2 ENSG000000…        467        515        621        365        587        799
##  3 ENSG000000…        260        211        263        164        245        331
##  4 ENSG000000…         60         55         40         35         78         63
##  5 ENSG000000…       3251       3679       6177       4252       6721      11027
##  6 ENSG000000…       1433       1062       1733        881       1424       1439
##  7 ENSG000000…        519        380        595        493        820        714
##  8 ENSG000000…        394        236        464        175        658        584
##  9 ENSG000000…        172        168        264        118        241        210
## 10 ENSG000000…       2112       1867       5137       2657       2735       2751
## # ℹ 17,782 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>

Q8. Challenge Question: Filter genes from acount that had a total count less than ten across all samples. Hint: Use column_to_rownames and look up rowSums(). For an alternative solution, check out the docs from rowwise operations.

Q8 Solution
f_acount<- acount %>% column_to_rownames("Feature") %>% filter(rowSums(.) > 10)  

# Alternatively 

f_acount2<- acount %>% filter(rowSums(pick(where(is.numeric))) > 10)