Skip to content

Lesson 4 Exercise Questions: Tidyverse

The filtlowabund_scaledcounts_airways.txt includes normalized and non-normalized transcript count data from an RNAseq experiment. You can read more about the experiment here. You can obtain the data outside of class here.

The diffexp_results_edger_airways.txt includes results from differential expression analysis using EdgeR. You can obtain the data outside of class here.

Putting what we have learned to the test:
The following questions synthesize several of the skills you have learned thus far. It may not be immediately apparent how you would go about answering these questions. Remember, the R community is expansive, and there are a number of ways to get help including but not limited to google search. These questions have multiple solutions, but try to solve the problem using tidyverse.

The normalized and non-normalized count data should be saved to the object scaled_counts. The differential expression results should be saved to the object dexp.

  1. Using scaled_counts, is there a difference in the number of transcripts with greater than 0 normalized counts ("counts_scaled") per sample? What did you use to answer this question.

    Solution
    table(scaled_counts[scaled_counts$counts_scaled>0,]$sample)
    

  2. Select the following columns from the scaled_counts data frame: sample, cell, dex, Run, transcript, avgLength, and counts_scaled. However, rearrange the columns so that the column 'Run' follows 'sample' and 'avgLength' is the last column. Save this to the object df_counts.

    Solution

    df_counts<-scaled_counts %>%
      select(sample, Run, cell, dex, transcript,counts_scaled,avgLength) 
    

  3. Using the differential expression results, create a data frame with the top five differentially expressed genes by p-value. Hint: Top genes in this case will have the smallest FDR corrected p-value and an absolute value of the log fold change greater than 2. (Lesson 4 challenge question)

    Solution

    topgene<-dexp %>%
      arrange(FDR) %>%
      filter(logFC >= abs(2)) %>%
      head(5)
    

  4. Filter the data frame scaled_counts to include only our top five differentially expressed genes (from question 3) and save to a new object named top_gene_counts.

    Solution

    top_gene_counts<-
      scaled_counts %>%
      filter(transcript %in% topgene$transcript) 
    

  5. Return a filtered data frame of the differential expression results. We want to look at only the transcripts with logCPM greater than 3 with a logFC greater than or equal to an absolute value of 2.5 and an adjusted (FDR) p-value less than 0.001.

    Solution

    dexp %>%
      filter(logCPM > 3,logFC >= abs(2.5), FDR < 0.001)