Lesson 15 Practice
Objectives
Previously, we performed QC on the Golden Snidget RNA sequencing data, aligned the sequencing reads to its genome, and obtained expression counts. We can now finally perform differential expression analysis, to find out which genes are differentially expressed between the EXCITED and BORED states of the Golden Snidget.
Scripts used for differential analysis
Recall that the scripts used for differential expression analysis are in the folder /usr/local/code. We can also access this folder using the environmental variable CODE. How many scripts are in this folder (find out by not using the full path to the folder containing the scripts).
Solution
ls -1 $CODE | wc -l
Make sure we change into ~/biostar_class/snidget before starting.
cd ~/biostar_class/snidget
Create a folder to store the Golden Snidget differential expression analysis results
First, create a folder to store the Golden Snidget differential expression analysis results. Name this folder snidget_deg.
Solution
mkdir snidget_deg
Create the design file
Create the design.csv file using the nano editor. Recall that the design files contain nothing more than a column with sample names and a column informing of sample treatment condition. Some of the R helper scripts require a csv version of this, where the columns are separated by comma. Here, we create both before moving on.
Creating design.csv
nano design.csv
Copy the text below to the nano editor, hit control-x and save to return to the terminal.
sample,condition
BORED_1.bam,BORED
BORED_2.bam,BORED
BORED_3.bam,BORED
EXCITED_1.bam,EXCITED
EXCITED_2.bam,EXCITED
EXCITED_3.bam,EXCITED
Obtaining the counts table
Change into ~/biostar_class/snidget/snidget_hisat2/ when running featureCounts to obtain the expression counts table. The annotation file for this dataset is in ~/biostar_class/snidget/refs and is named features.gff. How would we construct featureCounts to obtain an expression counts table for the Golden Snidget?
Solution
cd ~/biostar_class/snidget/snidget_hisat2/
featureCounts -p -a ../refs/features.gff -g gene_name -o ../snidget_deg/counts.txt BORED_1.bam BORED_2.bam BORED_3.bam EXCITED_1.bam EXCITED_2.bam EXCITED_3.bam
Format the Golden Snidget counts table for differential expression analysis
Remember that the deseq2.r script requires that the expression counts table be in csv format. It is currently in tab delimited format as generated by featureCounts. There is also a header line that we need to get rid of in the counts table. Finally, recall that our expression counts table is stored as counts.txt in the ~/biostar_class/snidget/snidget_deg directory, so change into this before moving forward.
cd ~/biostar_class/snidget/snidget_deg
Do you remember how to remove the header line in the counts table?
Solution
sed '1d' counts.txt > counts_no_header.txt
Save the counts table without header, we will need it later.
Next, how do we remove columns 2 through 6 of the counts table and convert it from tab delimited to csv?
Solution
cut -f1,7-12 counts_no_header.txt | tr '\t' ',' > counts.csv
Again, save the counts table without header, we will need it later.
Now that the correctly formated counts table is generated. Let's see if we can remember how to run deseq2.r to generate the differential expression results.
Solution
Rscript $CODE/deseq2.r
Take a look at the results.csv file, which contains the differential expression analysis output. Can we sorted by largest to smallest fold change? In the sorted results table, what do you notice? How well do the fold change results match expected?
Solution
column -t -s ',' results.csv | (sed 1q; sort -k 6nr) | less -S
Visualizing expression
Let's create an expression heatmap. How do we do this? Looking at the heatmap, do the treatments (ie. BORED and EXCITED) cluster well together?
Solution
Rscript $CODE/create_heatmap.r
We will need to copy the result, heatmap.pdf to the public folder to view. And the BORED and EXCITED groups do cluster together.