Skip to content

Lesson 5 Exercise Questions: Tidyverse

The filtlowabund_scaledcounts_airways.txt includes normalized and non-normalized transcript count data from an RNAseq experiment. You can read more about the experiment here. You can obtain the data outside of class here.

The diffexp_results_edger_airways.txt includes results from differential expression analysis using EdgeR. You can obtain the data outside of class here.

Putting what we have learned to the test:
The following questions synthesize several of the skills you have learned thus far. It may not be immediately apparent how you would go about answering these questions. Remember, the R community is expansive, and there are a number of ways to get help including but not limited to google search. These questions have multiple solutions, but try to solve the problem using tidyverse.

The normalized and non-normalized count data should be saved to the object scaled_counts. The differential expression results should be saved to the object dexp.

  1. Explore the column "avgLength" in scaled_counts. Does the data in this column vary within a sample? How could we figure this out if we didn't know what was in this column?

    Solution

    scaled_counts %>% group_by(sample) %>% summarize(median=median(avgLength),
                                                max=max(avgLength),
                                                min=min(avgLength))
    

  2. Create a column in scaled_counts named "z-counts" that contains a z-score transformation of the "counts" column.

    Solution
    scaled_counts %>% mutate(z_counts=scale(counts))
    

  3. Coerce the columns "sample" and "SampleName" from scaled_counts to type factor.

    Solution
    scaled_counts %>% mutate(across(c(sample, SampleName), as.factor))
    

  4. In the lesson 4 exercise, you created a data frame with the top five differentially expressed genes by p-value and logFC.

    topgene<-dexp %>%
        arrange(FDR) %>%
        filter(logFC >= abs(2)) %>%
        head(5)
    

    Create a data frame of the mean, median, and standard deviation of the normalized counts ("counts_scaled") for each of our top transcripts by treatment ("dex"). Is there a large amount of variation within a treatment?

    Solution

    scaled_counts %>%
    filter(transcript %in% topgene$transcript) %>%
    group_by(dex, transcript) %>% 
    summarize(mean_counts=mean(counts_scaled),
                sd=sd(counts_scaled),
                median=median(counts_scaled))