Mutate and Wrangle Challenge
Let's grab some data to work with.
library(tidyverse)
acount_smeta<-read_tsv("../data/countsANDmeta.txt")
acount_smeta
#raw count data
acount<-read_csv("../data/airway_rawcount.csv") %>%
dplyr::rename("Feature" = "...1")
acount
#differential expression results
dexp<-read_delim("../data/diffexp_results_edger_airways.txt")
dexp
Q1. Using mutate apply a base-10 logarithmic transformation to the numeric columns in acount
; add a pseudocount of 1 prior to this transformation. Save the resulting data frame to an object called log10counts
.
Q1: Solution
log10counts<- acount %>%
mutate(across(where(is.numeric),~log10(.x+1)))
log10counts
## # A tibble: 64,102 × 9
## Feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ENSG000000… 2.83 2.65 2.94 2.61 3.06 3.02
## 2 ENSG000000… 0 0 0 0 0 0
## 3 ENSG000000… 2.67 2.71 2.79 2.56 2.77 2.90
## 4 ENSG000000… 2.42 2.33 2.42 2.22 2.39 2.52
## 5 ENSG000000… 1.79 1.75 1.61 1.56 1.90 1.81
## 6 ENSG000000… 0 0 0.477 0 0.301 0
## 7 ENSG000000… 3.51 3.57 3.79 3.63 3.83 4.04
## 8 ENSG000000… 3.16 3.03 3.24 2.95 3.15 3.16
## 9 ENSG000000… 2.72 2.58 2.78 2.69 2.91 2.85
## 10 ENSG000000… 2.60 2.37 2.67 2.25 2.82 2.77
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>
Q2. Create a column in dexp
called Expression
. This column should say "Down-regulated" if logFC
is less than -1 or "Up-regulated" if logFC
is greater than 1. All other values should say "None".
Q2: Solution
dexp_new<-dexp %>%
mutate(Expression=case_when(logFC < -1 ~ "Down-regulated",
logFC > 1 ~ "Up-regulated",
.default = "None")
)
Challenge question:
Q3. Calculate the mean raw counts for each gene ("Feature") by treatment ("dex") in acount_smeta
. Combine these results with the differential expression results. Your resulting data frame should resemble the following:
# A tibble: 15,926 × 12
Feature Mean_Counts_trt Mean_Counts_untrt albut transcript ref_genome
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 ENSG00000000003 619. 865 untrt TSPAN6 hg38
2 ENSG00000000419 547. 523 untrt DPM1 hg38
3 ENSG00000000457 234. 250. untrt SCYL3 hg38
4 ENSG00000000460 53.2 63.5 untrt C1orf112 hg38
5 ENSG00000000971 6738. 5331. untrt CFH hg38
6 ENSG00000001036 1123. 1487. untrt FUCA2 hg38
7 ENSG00000001084 573. 658. untrt GCLC hg38
8 ENSG00000001167 316 469 untrt NFYA hg38
9 ENSG00000001460 168. 208 untrt STPG1 hg38
10 ENSG00000001461 2545 3113. untrt NIPAL3 hg38
# ℹ 15,916 more rows
# ℹ 6 more variables: .abundant <lgl>, logFC <dbl>, logCPM <dbl>, F <dbl>,
# PValue <dbl>, FDR <dbl>
Rows: 15,926
Columns: 12
$ Feature <chr> "ENSG00000000003", "ENSG00000000419", "ENSG000000004…
$ Mean_Counts_trt <dbl> 618.75, 546.75, 233.75, 53.25, 6738.25, 1122.75, 572…
$ Mean_Counts_untrt <dbl> 865.00, 523.00, 250.25, 63.50, 5331.25, 1487.25, 657…
$ albut <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt"…
$ transcript <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2…
$ ref_genome <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg3…
$ .abundant <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ logFC <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022…
$ logCPM <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.…
$ F <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-…
$ PValue <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.55469563…
$ FDR <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, …
Q3: Solution
a<-acount_smeta %>%
group_by(dex, Feature) %>%
summarise(mean_count = mean(Count)) %>%
pivot_wider(names_from=dex,values_from=mean_count,
names_prefix="Mean_Counts_") %>%
right_join(dexp, by=c("Feature" = "feature"))
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.
Q4. If you are interested in practicing data wrangling further, try this wrangling challenge.