Skip to content

Mutate and Wrangle Challenge

Let's grab some data to work with.

library(tidyverse)
acount_smeta<-read_tsv("../data/countsANDmeta.txt")
acount_smeta

#raw count data
acount<-read_csv("../data/airway_rawcount.csv") %>%
  dplyr::rename("Feature" = "...1")
acount


#differential expression results
dexp<-read_delim("../data/diffexp_results_edger_airways.txt")
dexp

Q1. Using mutate apply a base-10 logarithmic transformation to the numeric columns in acount; add a pseudocount of 1 prior to this transformation. Save the resulting data frame to an object called log10counts.

Q1: Solution
log10counts<- acount %>% 
  mutate(across(where(is.numeric),~log10(.x+1)))
log10counts
## # A tibble: 64,102 × 9
##    Feature     SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517
##    <chr>            <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1 ENSG000000…       2.83       2.65      2.94        2.61      3.06        3.02
##  2 ENSG000000…       0          0         0           0         0           0   
##  3 ENSG000000…       2.67       2.71      2.79        2.56      2.77        2.90
##  4 ENSG000000…       2.42       2.33      2.42        2.22      2.39        2.52
##  5 ENSG000000…       1.79       1.75      1.61        1.56      1.90        1.81
##  6 ENSG000000…       0          0         0.477       0         0.301       0   
##  7 ENSG000000…       3.51       3.57      3.79        3.63      3.83        4.04
##  8 ENSG000000…       3.16       3.03      3.24        2.95      3.15        3.16
##  9 ENSG000000…       2.72       2.58      2.78        2.69      2.91        2.85
## 10 ENSG000000…       2.60       2.37      2.67        2.25      2.82        2.77
## # ℹ 64,092 more rows
## # ℹ 2 more variables: SRR1039520 <dbl>, SRR1039521 <dbl>

Q2. Create a column in dexp called Expression. This column should say "Down-regulated" if logFC is less than -1 or "Up-regulated" if logFC is greater than 1. All other values should say "None".

Q2: Solution
dexp_new<-dexp %>% 
  mutate(Expression=case_when(logFC < -1 ~ "Down-regulated",
                              logFC > 1 ~ "Up-regulated",
                              .default = "None")
  )

Challenge question:

Q3. Calculate the mean raw counts for each gene ("Feature") by treatment ("dex") in acount_smeta. Combine these results with the differential expression results. Your resulting data frame should resemble the following:

# A tibble: 15,926 × 12
   Feature         Mean_Counts_trt Mean_Counts_untrt albut transcript ref_genome
   <chr>                     <dbl>             <dbl> <chr> <chr>      <chr>     
 1 ENSG00000000003           619.              865   untrt TSPAN6     hg38      
 2 ENSG00000000419           547.              523   untrt DPM1       hg38      
 3 ENSG00000000457           234.              250.  untrt SCYL3      hg38      
 4 ENSG00000000460            53.2              63.5 untrt C1orf112   hg38      
 5 ENSG00000000971          6738.             5331.  untrt CFH        hg38      
 6 ENSG00000001036          1123.             1487.  untrt FUCA2      hg38      
 7 ENSG00000001084           573.              658.  untrt GCLC       hg38      
 8 ENSG00000001167           316               469   untrt NFYA       hg38      
 9 ENSG00000001460           168.              208   untrt STPG1      hg38      
10 ENSG00000001461          2545              3113.  untrt NIPAL3     hg38      
# ℹ 15,916 more rows
# ℹ 6 more variables: .abundant <lgl>, logFC <dbl>, logCPM <dbl>, F <dbl>,
#   PValue <dbl>, FDR <dbl>
Rows: 15,926
Columns: 12
$ Feature           <chr> "ENSG00000000003", "ENSG00000000419", "ENSG000000004…
$ Mean_Counts_trt   <dbl> 618.75, 546.75, 233.75, 53.25, 6738.25, 1122.75, 572…
$ Mean_Counts_untrt <dbl> 865.00, 523.00, 250.25, 63.50, 5331.25, 1487.25, 657…
$ albut             <chr> "untrt", "untrt", "untrt", "untrt", "untrt", "untrt"…
$ transcript        <chr> "TSPAN6", "DPM1", "SCYL3", "C1orf112", "CFH", "FUCA2…
$ ref_genome        <chr> "hg38", "hg38", "hg38", "hg38", "hg38", "hg38", "hg3…
$ .abundant         <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
$ logFC             <dbl> -0.390100222, 0.197802179, 0.029160865, -0.124382022…
$ logCPM            <dbl> 5.059704, 4.611483, 3.482462, 1.473375, 8.089146, 5.…
$ F                 <dbl> 3.284948e+01, 6.903534e+00, 9.685073e-02, 3.772134e-…
$ PValue            <dbl> 0.0003117656, 0.0280616149, 0.7629129276, 0.55469563…
$ FDR               <dbl> 0.002831504, 0.077013489, 0.844247837, 0.682326613, …
Q3: Solution
a<-acount_smeta %>% 
  group_by(dex, Feature) %>%
  summarise(mean_count = mean(Count)) %>% 
  pivot_wider(names_from=dex,values_from=mean_count,
              names_prefix="Mean_Counts_") %>%
  right_join(dexp, by=c("Feature" = "feature"))
## `summarise()` has grouped output by 'dex'. You can override using the `.groups`
## argument.

Q4. If you are interested in practicing data wrangling further, try this wrangling challenge.