Exercise 4: Lesson 5

For this exercise we will use filtlowabund_scaledcounts_airways.txt, which includes normalized and non-normalized transcript count data from an RNAseq experiment. You can read more about the experiment here. To obtain this file, click here.

The following questions synthesize several of the skills you have learned thus far. It may not be immediately apparent how you would go about answering these questions. Remember, the R community is expansive, and there are a number of ways to get help including but not limited to google search. These questions have multiple solutions, but you should try to stick to the tools you have learned to use thus far.

Q1. Import filtlowabund_scaledcounts_airways.txt into R and save to an R object named transcript_counts. Try not to use the drop-down menu for loading the data.

Q1 Solution

transcript_counts <-read.delim("../data/filtlowabund_scaledcounts_airways.txt")

Q2. What are the dimensions of transcript_counts?

Q2 Solution

dim(transcript_counts)
## [1] 127408     18

Q3. What are the column names?

Q3 Solution

colnames(transcript_counts)
##  [1] "feature"       "sample"        "counts"        "SampleName"   
##  [5] "cell"          "dex"           "albut"         "Run"          
##  [9] "avgLength"     "Experiment"    "Sample"        "BioSample"    
## [13] "transcript"    "ref_genome"    ".abundant"     "TMM"          
## [17] "multiplier"    "counts_scaled"

Q4. Is there a difference in the number of transcripts with greater than 0 normalized counts (counts_scaled) per sample? What commands did you use to answer this question.

Q4 Solution

#using table
table(transcript_counts[transcript_counts$counts_scaled>0,]$sample)
## 
##   508   509   512   513   516   517   520   521 
## 15921 15919 15923 15918 15913 15920 15914 15910

#alternative solution
summary(factor(transcript_counts[transcript_counts$counts_scaled>0,]$sample))
##   508   509   512   513   516   517   520   521 
## 15921 15919 15923 15918 15913 15920 15914 15910

# or using the tidyverse
library(dplyr)
transcript_counts %>% filter(counts_scaled>0) %>% count(sample)
##   sample     n
## 1    508 15921
## 2    509 15919
## 3    512 15923
## 4    513 15918
## 5    516 15913
## 6    517 15920
## 7    520 15914
## 8    521 15910

Q5. How many categories of transcripts are there? Think about what you know regarding factors. Why is this number much smaller than the results of question 4?

Q5 Solution

nlevels(factor(transcript_counts$transcript, exclude = NULL))
## [1] 14576

Q6. Subset transcript_counts to only include the following columns: sample, cell, dex, transcript, avgLength, counts_scaled. Save this new dataframe to a new object called transc_df.

Q6 Solution

transc_df <- transcript_counts[c("sample","cell","dex",
                                "transcript","avgLength",
                                "counts_scaled")]

Q7. Using your new data frame from question six (transc_df), rename the column "sample" to "Sample".

Q7 Solution

colnames(transc_df)[1]<-"Sample"

Q8. What is the mean and standard deviation of "avgLength" across the entire transc_df data frame? Hint: Read the help documentation for mean() and sd().

Q8 Solution

mean_avgLength<- mean(transc_df$avgLength)
sd_avgLength<- sd(transc_df$avgLength)

Q9. Make a data frame with the column names "Mean" and "Standard_Dev" that holds the values from question 8. Hint: check out the function data.frame().

Q9 Solution

data.frame(Mean=mean_avgLength, Standard_Dev=sd_avgLength)
##     Mean Standard_Dev
## 1 113.75     14.85561