Lesson 5 Exercise Questions: Tidyverse

The filtlowabund_scaledcounts_airways.txt includes normalized and non-normalized transcript count data from an RNAseq experiment. You can read more about the experiment here. You can obtain the data outside of class here.

The diffexp_results_edger_airways.txt includes results from differential expression analysis using EdgeR. You can obtain the data outside of class here.

Putting what we have learned to the test:
The following questions synthesize several of the skills you have learned thus far. It may not be immediately apparent how you would go about answering these questions. Remember, the R community is expansive, and there are a number of ways to get help including but not limited to google search. These questions have multiple solutions, but try to solve the problem using tidyverse.

The normalized and non-normalized count data should be saved to the object scaled_counts. The differential expression results should be saved to the object dexp.

Explore the column "avgLength" in scaled_counts. Does the data in this column vary within a sample? How could we figure this out if we didn't know what was in this column?

Solution

scaled_counts %>% group_by(sample) %>% summarize(median=median(avgLength),
                                            max=max(avgLength),
                                            min=min(avgLength))

Create a column in scaled_counts named "z-counts" that contains a z-score transformation of the "counts" column.
Solution
```
scaled_counts %>% mutate(z_counts=scale(counts))
```
Coerce the columns "sample" and "SampleName" from scaled_counts to type factor.
Solution
```
scaled_counts %>% mutate(across(c(sample, SampleName), as.factor))
```
In the lesson 4 exercise, you created a data frame with the top five differentially expressed genes by p-value and logFC.
```
topgene<-dexp %>%
    arrange(FDR) %>%
    filter(logFC >= abs(2)) %>%
    head(5)
```
Create a data frame of the mean, median, and standard deviation of the normalized counts ("counts_scaled") for each of our top transcripts by treatment ("dex"). Is there a large amount of variation within a treatment?
Solution
```
scaled_counts %>%
filter(transcript %in% topgene$transcript) %>%
group_by(dex, transcript) %>% 
summarize(mean_counts=mean(counts_scaled),
            sd=sd(counts_scaled),
            median=median(counts_scaled))
```