Lesson 16 Practice
Objectives
In this lesson, we learned about the classification based approach for RNA sequencing analysis. In this approach, we are aligning our raw sequencing reads to a reference transcriptome rather than a genome. Here, we will get to practice what we learned using the Golden Snidget dataset.
Create a directory for the classification based analysis output
Before getting started, let's create an output directory for the classification based analysis results in the ~/biostar_class/snidget folder.
cd ~/biostar_class/snidget
mkdir -p classification_based/salmon
Indexing the reference transcriptome
To align raw sequencing reads to a reference transcriptome, we will need a reference transcriptome (ie. the sequences of known transcripts in FASTA format). Fortunately, a reference transcriptome is included with the Golden Snidget dataset, so we will just need to index the transcriptome using salmon. Do you recall how to do this? Note, we will store the transcriptome index in the ~/biostar_class/snidget/refs folder.
Solution
cd refs
salmon index -t transcripts.fa -i transcripts.idx
Obtaining expression counts
Next, we need to generate the counts (ie. number of reads that map to a transcript). But first, change back into the ~/biostar_class/snidget folder and then take a moment to think about how you can do this for all of the samples in one go. Hint, it will be easier if you go into ~/biostar_class/snidget/reads and create a text file called ids.txt that contains the sample names for this data set.
Solution
cd ~/biostar_class/snidget/reads
parallel echo {1}_{2} ::: BORED EXCITED ::: 1 2 3 > ids.txt
Now, go back up one directory to ~/biostar_class/snidget
cd ~/biostar_class/snidget
cat reads/ids.txt | parallel "salmon quant -i refs/transcripts.idx -l A --validateMappings -1 reads/{}_R1.fq -2 reads/{}_R2.fq -o classification_based/salmon/{}_SALMON"
After getting the counts, change into the classification_based folder. How do we do this from the ~/biostar_class/snidget directory.
Solution
cd classification_based
To proceed further with the analysis we need a design.csv file with the sample names, which essentially are the names of salmon output folders and the treatment condition of the samples. Take a moment to create the design.csv file.
Solution
nano design.csv
Copy and paste the text below into the nano editor, hit control x and then save to go back to the terminal.
sample,condition
BORED_1_SALMON,BORED
BORED_2_SALMON,BORED
BORED_3_SALMON,BORED
EXCITED_1_SALMON,EXCITED
EXCITED_2_SALMON,EXCITED
EXCITED_3_SALMON,EXCITED
Remember salmon quant generates one folder for each sample and saves the counts for each sample in that particular folder. We will need to combine these count files into one. Do you remember how to do this?
Solution
We need to use the script combine_transcripts.r
Rscript $CODE/combine_transcripts.r design.csv salmon
Differential expression analysis
Next, how do we generate the differential expression results?
Solution
Rscript $CODE/deseq2.r
Visualization
Can we generate an expression heatmap?
Solution
Rscript $CODE/create_heatmap.r
Next, let's generate the Principal Components Analysis plot. But first, we need to convert the counts.csv and design.csv files to their tab delimited counterparts. Remember in Lesson 14 that we used the tr command to convert a tab delimited text file to a comma separated file. Can you do the opposite?
Solution
cat counts.csv | tr ',' '\t' > counts.txt
cat design.csv | tr ',' '\t' > design.txt
After generating the tab delimited expression counts table and design file, take a moment to recall how we would generate the Principal Components Analysis plot.
Solution
Rscript $CODE/create_pca.r counts.txt design.txt 6
Take a look at the differential expression results from the classification based approach. Do the gene expression changes reflect the expected?
Solution
column -t -s ',' results.csv | less -S
Bonus question (if there is time)
Can we combine just the fold change columns in the alignment based and classification based differential analysis results to compare the two?
Hint: first copy the results.csv file from the alignment based method in the ~/biostar_class/snidget/snidget_deg folder to ~/biostar_class/snidget and name it snidget_deg_genome.csv. Then, copy the results.csv file from the ~/biostar_class/snidget/classfication_based folder to the ~/biostar_class/snidget folder and rename it snidget_deg_transcriptome.csv. Then make use of the cut, sed, sort, and paste commands to get a merged file. Work in the ~/biostar_class/snidget directory for this exercise.
Solution
cd ~/biostar_class/snidget
cp ~/biostar_class/snidget/snidget_deg/results.csv snidget_deg_genome.csv
cp classification_based/results.csv snidget_deg_transcriptome.csv
cut -f1,5 -d ',' snidget_deg_transcriptome.csv | (sed -u 1q; sort) | less -S > snidget_deg_transcriptome_sorted.csv
cut -f1,5 -d ',' snidget_deg_genome.csv | (sed -u 1q; sort) | less -S > snidget_deg_genome_sorted.csv
paste snidget_deg_genome_sorted.csv snidget_deg_transcriptome_sorted.csv > snidget_deg_genome_vs_transcriptome.csv
The left half of the table are fold changes derived from the alignment based analysis. The right half of the table are fold changes derived from classification based analysis.
column -t -s ',' snidget_deg_genome_vs_transcriptome.csv | less -S