Lesson 15 Practice

Objectives

Previously, we performed QC on the Golden Snidget RNA sequencing data, aligned the sequencing reads to its genome, and obtained expression counts. We can now finally perform differential expression analysis, to find out which genes are differentially expressed between the EXCITED and BORED states of the Golden Snidget.

Scripts used for differential analysis

Recall that the scripts used for differential expression analysis are in the folder /usr/local/code. We can also access this folder using the environmental variable CODE. How many scripts are in this folder (find out by not using the full path to the folder containing the scripts).

Solution

ls -1 $CODE | wc -l

Make sure we change into ~/biostar_class/snidget before starting.

cd ~/biostar_class/snidget

Create a folder to store the Golden Snidget differential expression analysis results

First, create a folder to store the Golden Snidget differential expression analysis results. Name this folder snidget_deg.

Solution

mkdir snidget_deg

Create the design file

Create the design.csv file using the nano editor. Recall that the design files contain nothing more than a column with sample names and a column informing of sample treatment condition. Some of the R helper scripts require a csv version of this, where the columns are separated by comma. Here, we create both before moving on.

Creating design.csv

nano design.csv

Copy the text below to the nano editor, hit control-x and save to return to the terminal.

sample,condition
BORED_1.bam,BORED
BORED_2.bam,BORED
BORED_3.bam,BORED
EXCITED_1.bam,EXCITED
EXCITED_2.bam,EXCITED
EXCITED_3.bam,EXCITED

Obtaining the counts table

Change into ~/biostar_class/snidget/snidget_hisat2/ when running featureCounts to obtain the expression counts table. The annotation file for this dataset is in ~/biostar_class/snidget/refs and is named features.gff. How would we construct featureCounts to obtain an expression counts table for the Golden Snidget?

Solution

cd ~/biostar_class/snidget/snidget_hisat2/

featureCounts -p -a ../refs/features.gff -g gene_name -o ../snidget_deg/counts.txt BORED_1.bam BORED_2.bam BORED_3.bam EXCITED_1.bam EXCITED_2.bam EXCITED_3.bam

Format the Golden Snidget counts table for differential expression analysis

Remember that the deseq2.r script requires that the expression counts table be in csv format. It is currently in tab delimited format as generated by featureCounts. There is also a header line that we need to get rid of in the counts table. Finally, recall that our expression counts table is stored as counts.txt in the ~/biostar_class/snidget/snidget_deg directory, so change into this before moving forward.

cd ~/biostar_class/snidget/snidget_deg

Do you remember how to remove the header line in the counts table?

Solution

sed '1d' counts.txt > counts_no_header.txt

Save the counts table without header, we will need it later.

Next, how do we remove columns 2 through 6 of the counts table and convert it from tab delimited to csv?

Solution

cut -f1,7-12 counts_no_header.txt | tr '\t' ',' > counts.csv

Again, save the counts table without header, we will need it later.

Now that the correctly formated counts table is generated. Let's see if we can remember how to run deseq2.r to generate the differential expression results.

Solution

Rscript $CODE/deseq2.r

Take a look at the results.csv file, which contains the differential expression analysis output. Can we sorted by largest to smallest fold change? In the sorted results table, what do you notice? How well do the fold change results match expected?

Solution

column -t -s ',' results.csv | (sed 1q; sort -k 6nr) | less -S

Visualizing expression

Let's create an expression heatmap. How do we do this? Looking at the heatmap, do the treatments (ie. BORED and EXCITED) cluster well together?

Solution

Rscript $CODE/create_heatmap.r

We will need to copy the result, heatmap.pdf to the public folder to view. And the BORED and EXCITED groups do cluster together.