Lesson 16 Practice

Objectives

In this lesson, we learned about the classification based approach for RNA sequencing analysis. In this approach, we are aligning our raw sequencing reads to a reference transcriptome rather than a genome. Here, we will get to practice what we learned using the Golden Snidget dataset.

Create a directory for the classification based analysis output

Before getting started, let's create an output directory for the classification based analysis results in the ~/biostar_class/snidget folder.

cd ~/biostar_class/snidget

mkdir -p classification_based/salmon

Indexing the reference transcriptome

To align raw sequencing reads to a reference transcriptome, we will need a reference transcriptome (ie. the sequences of known transcripts in FASTA format). Fortunately, a reference transcriptome is included with the Golden Snidget dataset, so we will just need to index the transcriptome using salmon. Do you recall how to do this? Note, we will store the transcriptome index in the ~/biostar_class/snidget/refs folder.

Solution

cd refs

salmon index -t transcripts.fa -i transcripts.idx

Obtaining expression counts

Next, we need to generate the counts (ie. number of reads that map to a transcript). But first, change back into the ~/biostar_class/snidget folder and then take a moment to think about how you can do this for all of the samples in one go. Hint, it will be easier if you go into ~/biostar_class/snidget/reads and create a text file called ids.txt that contains the sample names for this data set.

Solution

cd ~/biostar_class/snidget/reads

parallel echo {1}_{2} ::: BORED EXCITED ::: 1 2 3 > ids.txt

Now, go back up one directory to ~/biostar_class/snidget

cd ~/biostar_class/snidget

cat reads/ids.txt | parallel "salmon quant -i refs/transcripts.idx -l A --validateMappings -1 reads/{}_R1.fq -2 reads/{}_R2.fq -o classification_based/salmon/{}_SALMON"

After getting the counts, change into the classification_based folder. How do we do this from the ~/biostar_class/snidget directory.

Solution

cd classification_based

To proceed further with the analysis we need a design.csv file with the sample names, which essentially are the names of salmon output folders and the treatment condition of the samples. Take a moment to create the design.csv file.

Solution

nano design.csv

Copy and paste the text below into the nano editor, hit control x and then save to go back to the terminal.

sample,condition
BORED_1_SALMON,BORED
BORED_2_SALMON,BORED
BORED_3_SALMON,BORED
EXCITED_1_SALMON,EXCITED
EXCITED_2_SALMON,EXCITED
EXCITED_3_SALMON,EXCITED

Remember salmon quant generates one folder for each sample and saves the counts for each sample in that particular folder. We will need to combine these count files into one. Do you remember how to do this?

Solution

We need to use the script combine_transcripts.r

Rscript $CODE/combine_transcripts.r design.csv salmon

Differential expression analysis

Next, how do we generate the differential expression results?

Solution

Rscript $CODE/deseq2.r

Visualization

Can we generate an expression heatmap?

Solution

Rscript $CODE/create_heatmap.r

Next, let's generate the Principal Components Analysis plot. But first, we need to convert the counts.csv and design.csv files to their tab delimited counterparts. Remember in Lesson 14 that we used the tr command to convert a tab delimited text file to a comma separated file. Can you do the opposite?

Solution

cat counts.csv | tr ',' '\t' > counts.txt

cat design.csv | tr ',' '\t' > design.txt

After generating the tab delimited expression counts table and design file, take a moment to recall how we would generate the Principal Components Analysis plot.

Solution

Rscript $CODE/create_pca.r counts.txt design.txt 6

Take a look at the differential expression results from the classification based approach. Do the gene expression changes reflect the expected?

Solution

column -t -s ',' results.csv | less -S

Bonus question (if there is time)

Can we combine just the fold change columns in the alignment based and classification based differential analysis results to compare the two?

Hint: first copy the results.csv file from the alignment based method in the ~/biostar_class/snidget/snidget_deg folder to ~/biostar_class/snidget and name it snidget_deg_genome.csv. Then, copy the results.csv file from the ~/biostar_class/snidget/classfication_based folder to the ~/biostar_class/snidget folder and rename it snidget_deg_transcriptome.csv. Then make use of the cut, sed, sort, and paste commands to get a merged file. Work in the ~/biostar_class/snidget directory for this exercise.

Solution

cd ~/biostar_class/snidget

cp ~/biostar_class/snidget/snidget_deg/results.csv snidget_deg_genome.csv

cp classification_based/results.csv snidget_deg_transcriptome.csv

cut -f1,5 -d ',' snidget_deg_transcriptome.csv | (sed -u 1q; sort) | less -S > snidget_deg_transcriptome_sorted.csv

cut -f1,5 -d ',' snidget_deg_genome.csv | (sed -u 1q; sort) | less -S > snidget_deg_genome_sorted.csv

paste snidget_deg_genome_sorted.csv snidget_deg_transcriptome_sorted.csv > snidget_deg_genome_vs_transcriptome.csv

The left half of the table are fold changes derived from the alignment based analysis. The right half of the table are fold changes derived from classification based analysis.

column -t -s ',' snidget_deg_genome_vs_transcriptome.csv | less -S