Select and Filter
All solutions use the pipe. Solutions have multiple possibilities.
Q1. Import the file "./data/filtlowabund_scaledcounts_airways.txt" and save to an object named sc. Create a data frame from sc that only includes the columns sample, cell, dex, transcript, and counts_scaled and only rows that include the treatment "untrt" and the transcripts "ACTN1" and "ANAPC4"?
Q1 Solution
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sc <- read_delim("../data/filtlowabund_scaledcounts_airways.txt")
## Rows: 127408 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (11): feature, SampleName, cell, dex, albut, Run, Experiment, Sample, Bi...
## dbl (6): sample, counts, avgLength, TMM, multiplier, counts_scaled
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cnames <- c('sample', 'cell', 'dex', 'transcript', 'counts_scaled')
sc %>% select(all_of(cnames)) %>% filter(dex == "untrt" & (transcript %in% c("ACTN1","ANAPC4") ))
## # A tibble: 8 × 5
## sample cell dex transcript counts_scaled
## <dbl> <chr> <chr> <chr> <dbl>
## 1 508 N61311 untrt ANAPC4 777.
## 2 508 N61311 untrt ACTN1 14410.
## 3 512 N052611 untrt ANAPC4 786.
## 4 512 N052611 untrt ACTN1 16644.
## 5 516 N080611 untrt ANAPC4 709.
## 6 516 N080611 untrt ACTN1 15805.
## 7 520 N061011 untrt ANAPC4 827.
## 8 520 N061011 untrt ACTN1 16015.
Q2. Using dexp ("./data/diffexp_results_edger_airways.txt") create a data frame containing the top 5 differentially expressed genes and save to an object named top5. Top genes in this case will have the smallest FDR corrected p-value and an absolute value of the log fold change greater than 2. See dplyr::slice().
Q2 Solution
dexp<-read_delim("../data/diffexp_results_edger_airways.txt")
## Rows: 15926 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): feature, albut, transcript, ref_genome
## dbl (5): logFC, logCPM, F, PValue, FDR
## lgl (1): .abundant
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
top5<- dexp %>%
dplyr::filter(abs(logFC) > 2) %>%
slice_min(n=5,order_by=FDR, with_ties=FALSE)
top5
## # A tibble: 5 × 10
## feature albut transcript ref_genome .abundant logFC logCPM F PValue
## <chr> <chr> <chr> <chr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG0000010… untrt ZBTB16 hg38 TRUE 7.15 4.15 1429. 5.11e-11
## 2 ENSG0000016… untrt CACNB2 hg38 TRUE 3.28 4.51 1575. 3.34e-11
## 3 ENSG0000012… untrt DUSP1 hg38 TRUE 2.94 7.31 694. 1.18e- 9
## 4 ENSG0000014… untrt PRSS35 hg38 TRUE -2.76 3.91 807. 6.16e-10
## 5 ENSG0000015… untrt SPARCL1 hg38 TRUE 4.56 5.53 721. 1.00e- 9
## # ℹ 1 more variable: FDR <dbl>
Q3. Filter sc to contain only the top 5 differentially expressed genes.
Q3 Solution
sc %>% dplyr::filter(transcript %in% top5$transcript)
## # A tibble: 40 × 18
## feature sample counts SampleName cell dex albut Run avgLength Experiment
## <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 ENSG00… 508 4 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 2 ENSG00… 508 665 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 3 ENSG00… 508 330 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 4 ENSG00… 508 62 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 5 ENSG00… 508 80 GSM1275862 N613… untrt untrt SRR1… 126 SRX384345
## 6 ENSG00… 509 739 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 7 ENSG00… 509 5020 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 8 ENSG00… 509 41 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 9 ENSG00… 509 2040 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## 10 ENSG00… 509 731 GSM1275863 N613… trt untrt SRR1… 126 SRX384346
## # ℹ 30 more rows
## # ℹ 8 more variables: Sample <chr>, BioSample <chr>, transcript <chr>,
## # ref_genome <chr>, .abundant <lgl>, TMM <dbl>, multiplier <dbl>,
## # counts_scaled <dbl>
Q4. Select only columns of type character from sc.
Q4 Solution
sc %>% select(where(is.character))
## # A tibble: 127,408 × 11
## feature SampleName cell dex albut Run Experiment Sample BioSample
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ENSG000000000… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 2 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 3 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 4 ENSG000000004… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 5 ENSG000000009… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 6 ENSG000000010… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 7 ENSG000000010… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 8 ENSG000000011… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 9 ENSG000000014… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## 10 ENSG000000014… GSM1275862 N613… untrt untrt SRR1… SRX384345 SRS50… SAMN0242…
## # ℹ 127,398 more rows
## # ℹ 2 more variables: transcript <chr>, ref_genome <chr>
Q5. Select all columns from dexp except .abundant and PValue. Keep only rows with FDR less than or equal to 0.01.
Q5 Solution
dexp %>% select(-c(.abundant,PValue)) %>% filter(FDR <= 0.01)
## # A tibble: 2,763 × 8
## feature albut transcript ref_genome logFC logCPM F FDR
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG00000000003 untrt TSPAN6 hg38 -0.390 5.06 32.8 0.00283
## 2 ENSG00000000971 untrt CFH hg38 0.417 8.09 29.3 0.00376
## 3 ENSG00000001167 untrt NFYA hg38 -0.509 4.13 44.9 0.00126
## 4 ENSG00000002834 untrt LASP1 hg38 0.388 8.39 22.7 0.00722
## 5 ENSG00000003096 untrt KLHL13 hg38 -0.949 4.16 84.8 0.000234
## 6 ENSG00000003402 untrt CFLAR hg38 1.18 6.90 130. 0.0000800
## 7 ENSG00000003987 untrt MTMR7 hg38 0.993 0.341 24.7 0.00585
## 8 ENSG00000004059 untrt ARF5 hg38 0.358 5.84 30.9 0.00328
## 9 ENSG00000004487 untrt KDM1A hg38 -0.308 5.86 23.5 0.00663
## 10 ENSG00000004700 untrt RECQL hg38 0.360 5.60 22.7 0.00721
## # ℹ 2,753 more rows
Q6. Import the file "./data/airway_rawcount.csv". Use the function rename() to rename the first column. Use the pipe to import and rename successively without intermediate steps or function nesting. Save to an object named acount.
Q6 Solution
acount<-read_csv("../data/airway_rawcount.csv") %>%
dplyr::rename(Feature = ...1)
## New names:
## Rows: 64102 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (8): SRR1039508, SRR1039509, SRR1039512, SRR1039513, SRR1039516,
## SRR1039...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
acount
## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 679 448 873 408 1138 1047
## 2 ENSG000000… 0 0 0 0 0 0
## 3 ENSG000000… 467 515 621 365 587 799
## 4 ENSG000000… 260 211 263 164 245 331
## 5 ENSG000000… 60 55 40 35 78 63
## 6 ENSG000000… 0 0 2 0 1 0
## 7 ENSG000000… 3251 3679 6177 4252 6721 11027
## 8 ENSG000000… 1433 1062 1733 881 1424 1439
## 9 ENSG000000… 519 380 595 493 820 714
## 10 ENSG000000… 394 236 464 175 658 584
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
Q7. Use filter on the object acount to keep only genes that had a count greater than 10 in at least one sample.
Q7 Solution
acount %>%
filter(if_any(where(is.numeric), ~.> 10))
## # A tibble: 17,792 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 679 448 873 408 1138 1047
## 2 ENSG000000… 467 515 621 365 587 799
## 3 ENSG000000… 260 211 263 164 245 331
## 4 ENSG000000… 60 55 40 35 78 63
## 5 ENSG000000… 3251 3679 6177 4252 6721 11027
## 6 ENSG000000… 1433 1062 1733 881 1424 1439
## 7 ENSG000000… 519 380 595 493 820 714
## 8 ENSG000000… 394 236 464 175 658 584
## 9 ENSG000000… 172 168 264 118 241 210
## 10 ENSG000000… 2112 1867 5137 2657 2735 2751
## # ℹ 17,782 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
Q8. Challenge Question: Filter genes from acount that had a total count less than ten across all samples. Hint: Use column_to_rownames and look up rowSums(). For an alternative solution, check out the docs from rowwise operations.
Q8 Solution
f_acount<- acount %>% column_to_rownames("Feature") %>% filter(rowSums(.) > 10)
# Alternatively
f_acount2<- acount %>% filter(rowSums(pick(where(is.numeric))) > 10)