Lesson 2 Exercise Questions: Part 2 (Tidyverse)
The filtlowabund_scaledcounts_airways.txt includes normalized and non-normalized transcript count data from an RNAseq experiment. You can read more about the experiment here. You can obtain the data outside of class here.
The diffexp_results_edger_airways.txt includes results from differential expression analysis using EdgeR. You can obtain the data outside of class here.
Putting what we have learned to the test:
The following questions synthesize several of the skills you have learned thus far. It may not be immediately apparent how you would go about answering these questions. Remember, the R community is expansive, and there are a number of ways to get help including but not limited to google search. These questions have multiple solutions, but try to solve the problem using tidyverse.
The normalized and non-normalized count data should be saved to the object scaled_counts
. The differential expression results should be saved to the object dexp
.
-
Select the following columns from the
scaled_counts
data frame: sample, cell, dex, Run, transcript, avgLength, and counts_scaled. However, rearrange the columns so that the column 'Run' follows 'sample' and 'avgLength' is the last column. Save this to the objectdf_counts
. -
Explore the column 'avgLength' in
df_counts
. Does the data in this column vary within a sample? How could we figure this out if we didn't know what was in this column? -
Create a data frame that contains the mean, standard deviation, median, minimum, and maximum of the normalized counts (in column counts_scaled) by treatment (dex) and cell line (cell). Store this in an object named
sumstats_counts
. -
Using the differential expression results, create a data frame with the top five differentially expressed genes by p-value. Hint: Top genes in this case will have the smallest FDR corrected p-value and an absolute value of the log fold change greater than 2. (Lesson 2 challenge question)
-
Filter the data frame
scaled_counts
to include only our top five differentially expressed genes (from question 4) and save to a new object namedtop_gene_counts
. -
Create a data frame of the mean, median, and standard deviation of the normalized counts for each of our top transcripts by treatment (dex). Is there a large amount of variation within a treatment?
-
Return a filtered data frame of the differential expression results. We want to look at only the transcripts with logCPM greater than 3 with a logFC greater than or equal to an absolute value of 2.5 and an adjusted (FDR) p-value less than 0.001.