Lesson 5 Exercise Questions: Tidyverse
The filtlowabund_scaledcounts_airways.txt
includes normalized and non-normalized transcript count data from an RNAseq experiment. You can read more about the experiment here. You can obtain the data outside of class here.
The diffexp_results_edger_airways.txt
includes results from differential expression analysis using EdgeR. You can obtain the data outside of class here.
Putting what we have learned to the test:
The following questions synthesize several of the skills you have learned thus far. It may not be immediately apparent how you would go about answering these questions. Remember, the R community is expansive, and there are a number of ways to get help including but not limited to google search. These questions have multiple solutions, but try to solve the problem using tidyverse.
The normalized and non-normalized count data should be saved to the object scaled_counts
. The differential expression results should be saved to the object dexp
.
-
Explore the column "avgLength" in
scaled_counts
. Does the data in this column vary within a sample? How could we figure this out if we didn't know what was in this column?Solution
scaled_counts %>% group_by(sample) %>% summarize(median=median(avgLength), max=max(avgLength), min=min(avgLength))
-
Create a column in
scaled_counts
named "z-counts" that contains a z-score transformation of the "counts" column.Solution
scaled_counts %>% mutate(z_counts=scale(counts))
-
Coerce the columns "sample" and "SampleName" from
scaled_counts
to type factor.Solution
scaled_counts %>% mutate(across(c(sample, SampleName), as.factor))
-
In the lesson 4 exercise, you created a data frame with the top five differentially expressed genes by p-value and logFC.
topgene<-dexp %>% arrange(FDR) %>% filter(logFC >= abs(2)) %>% head(5)
Create a data frame of the mean, median, and standard deviation of the normalized counts ("counts_scaled") for each of our top transcripts by treatment ("dex"). Is there a large amount of variation within a treatment?
Solution
scaled_counts %>% filter(transcript %in% topgene$transcript) %>% group_by(dex, transcript) %>% summarize(mean_counts=mean(counts_scaled), sd=sd(counts_scaled), median=median(counts_scaled))