Select and Filter
All solutions use the pipe. Solutions have multiple possibilities.
Q1. Import the file "./data/filtlowabund_scaledcounts_airways.txt" and save to an object named sc
. Create a data frame from sc
that only includes the columns sample
, cell
, dex
, transcript
, and counts_scaled
and only rows that include the treatment "untrt" and the transcripts "ACTN1" and "ANAPC4"?
Q1 Solution
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sc <- read_delim("../data/filtlowabund_scaledcounts_airways.txt")
## Rows: 127408 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (11): feature, SampleName, cell, dex, albut, Run, Experiment, Sample, Bi...
## dbl (6): sample, counts, avgLength, TMM, multiplier, counts_scaled
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cnames <- c('sample', 'cell', 'dex', 'transcript', 'counts_scaled')
sc %>% select(all_of(cnames)) %>% filter(dex == "untrt" & (transcript %in% c("ACTN1","ANAPC4") ))
## # A tibble: 8 × 5
## sample cell dex transcript counts_scaled
## <dbl> <chr> <chr> <chr> <dbl>
## 1 508 N61311 untrt ANAPC4 777.
## 2 508 N61311 untrt ACTN1 14410.
## 3 512 N052611 untrt ANAPC4 786.
## 4 512 N052611 untrt ACTN1 16644.
## 5 516 N080611 untrt ANAPC4 709.
## 6 516 N080611 untrt ACTN1 15805.
## 7 520 N061011 untrt ANAPC4 827.
## 8 520 N061011 untrt ACTN1 16015.
Q2. Using dexp
("./data/diffexp_results_edger_airways.txt") create a data frame containing the top 5 differentially expressed genes and save to an object named top5
. Top genes in this case will have the smallest FDR
corrected p-value and an absolute value of the log fold change greater than 2. See dplyr::slice()
.
Q2 Solution
dexp<-read_delim("../data/diffexp_results_edger_airways.txt")
## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
top5<- dexp %>%
dplyr::filter(abs(logFC) > 2) %>%
slice_min(n=5,order_by=FDR, with_ties=FALSE)
top5
## # A tibble: 5 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG0000010… untrt ZBTB16 hg38 TRUE 7.15 4.15 1429. 5.11e-11
## 2 ENSG0000016… untrt CACNB2 hg38 TRUE 3.28 4.51 1575. 3.34e-11
## 3 ENSG0000012… untrt DUSP1 hg38 TRUE 2.94 7.31 694. 1.18e- 9
## 4 ENSG0000014… untrt PRSS35 hg38 TRUE -2.76 3.91 807. 6.16e-10
## 5 ENSG0000015… untrt SPARCL1 hg38 TRUE 4.56 5.53 721. 1.00e- 9
## # ℹ 1 more variable: FDR <dbl>
Q3. Filter sc
to contain only the top 5 differentially expressed genes.
Q3 Solution
sc %>% dplyr::filter(transcript %in% top5$transcript)
## # A tibble: 40 × 18
## feature sample counts SampleName cell dex albut Run avgLength Experiment
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 ENSG00… 508 4 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 2 ENSG00… 508 665 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 3 ENSG00… 508 330 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 4 ENSG00… 508 62 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 5 ENSG00… 508 80 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 6 ENSG00… 509 739 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 7 ENSG00… 509 5020 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 8 ENSG00… 509 41 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 9 ENSG00… 509 2040 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 10 ENSG00… 509 731 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## # ℹ 30 more rows
## # ℹ 8 more variables: Sample <chr>, BioSample <chr>, transcript <chr>,
## # ref_genome <chr>, .abundant <lgl>, TMM <dbl>, multiplier <dbl>,
## # counts_scaled <dbl>
Q4. Select only columns of type character from sc
.
Q4 Solution
sc %>% select(where(is.character))
## # A tibble: 127,408 × 11
## feature SampleName cell dex albut Run Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ENSG000000000… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 2 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 3 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 4 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 5 ENSG000000009… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 6 ENSG000000010… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 7 ENSG000000010… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 8 ENSG000000011… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 9 ENSG000000014… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 10 ENSG000000014… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## # ℹ 127,398 more rows
## # ℹ 2 more variables: transcript <chr>, ref_genome <chr>
Q5. Select all columns from dexp
except .abundant
and PValue
. Keep only rows with FDR
less than or equal to 0.01.
Q5 Solution
dexp %>% select(-c(.abundant,PValue)) %>% filter(FDR <= 0.01)
## # A tibble: 2,763 × 8
## feature albut transcript ref_genome logFC logCPM F FDR
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 untrt TSPAN6 hg38 -0.390 5.06 32.8 0.00283
## 2 ENSG00000000971 untrt CFH hg38 0.417 8.09 29.3 0.00376
## 3 ENSG00000001167 untrt NFYA hg38 -0.509 4.13 44.9 0.00126
## 4 ENSG00000002834 untrt LASP1 hg38 0.388 8.39 22.7 0.00722
## 5 ENSG00000003096 untrt KLHL13 hg38 -0.949 4.16 84.8 0.000234
## 6 ENSG00000003402 untrt CFLAR hg38 1.18 6.90 130. 0.0000800
## 7 ENSG00000003987 untrt MTMR7 hg38 0.993 0.341 24.7 0.00585
## 8 ENSG00000004059 untrt ARF5 hg38 0.358 5.84 30.9 0.00328
## 9 ENSG00000004487 untrt KDM1A hg38 -0.308 5.86 23.5 0.00663
## 10 ENSG00000004700 untrt RECQL hg38 0.360 5.60 22.7 0.00721
## # ℹ 2,753 more rows
Q6. Import the file "./data/airway_rawcount.csv". Use the function rename()
to rename the first column. Use the pipe to import and rename successively without intermediate steps or function nesting. Save to an object named acount
.
Q6 Solution
acount<-read_csv("../data/airway_rawcount.csv") %>%
dplyr::rename(Feature = ...1)
## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
acount
## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 679 448 873 408 1138 1047
## 2 ENSG000000… 0 0 0 0 0 0
## 3 ENSG000000… 467 515 621 365 587 799
## 4 ENSG000000… 260 211 263 164 245 331
## 5 ENSG000000… 60 55 40 35 78 63
## 6 ENSG000000… 0 0 2 0 1 0
## 7 ENSG000000… 3251 3679 6177 4252 6721 11027
## 8 ENSG000000… 1433 1062 1733 881 1424 1439
## 9 ENSG000000… 519 380 595 493 820 714
## 10 ENSG000000… 394 236 464 175 658 584
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
Q7. Use filter on the object acount
to keep only genes that had a count greater than 10 in at least one sample.
Q7 Solution
acount %>%
filter(if_any(where(is.numeric), ~.> 10))
## # A tibble: 17,792 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 679 448 873 408 1138 1047
## 2 ENSG000000… 467 515 621 365 587 799
## 3 ENSG000000… 260 211 263 164 245 331
## 4 ENSG000000… 60 55 40 35 78 63
## 5 ENSG000000… 3251 3679 6177 4252 6721 11027
## 6 ENSG000000… 1433 1062 1733 881 1424 1439
## 7 ENSG000000… 519 380 595 493 820 714
## 8 ENSG000000… 394 236 464 175 658 584
## 9 ENSG000000… 172 168 264 118 241 210
## 10 ENSG000000… 2112 1867 5137 2657 2735 2751
## # ℹ 17,782 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
Q8. Challenge Question: Filter genes from acount
that had a total count less than ten across all samples. Hint: Use column_to_rownames
and look up rowSums()
. For an alternative solution, check out the docs from rowwise operations.
Q8 Solution
f_acount<- acount %>% column_to_rownames("Feature") %>% filter(rowSums(.) > 10)
# Alternatively
f_acount2<- acount %>% filter(rowSums(pick(where(is.numeric))) > 10)