Lesson 9 Practice

Objectives

In this practice session, we will apply our knowledge to

learn about the reference genome and annotation file for the Golden Snidget dataset
visualize the Golden Snidget genome using the Integrative Genome Viewer (IGV) - instructor will demo this and you can practice on your own after getting IGV installed.

Create a directory for the Golden Snidget analysis

Before getting started, let's create a folder called snidget within the ~/biostar_class directory to conduct our analysis. To do this, we need to first go to the ~/biostar_class folder, how do you check where you are and change into this folder.

Solution

pwd

If you are not in the ~/biostar_class folder, then change into it.

cd ~/biostar_class

Next, we will create the directory snidget within the ~/biostar_class folder. Take a moment to see if you can do this.

solution

mkdir snidget

Now, we need to change into the "snidget" directory

solution

cd snidget

Where is my data?

The Golden Snidget reference genome is located at http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz. Can you download and extract?

Solution

Download

wget http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz

OR

curl http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz -o golden.genome.tar.gz

Unpack

tar -xvf golden.genome.tar.gz

After unpacking golden.genome.tar.gz, what do you see?

Solution

ls -l

total 60
-rw-rw-r-- 1 joe joe 57462 Feb  5  2020 golden.genome.tar.gz
drwxrwxr-x 1 joe joe    70 Oct 25 00:06 refs

In addition to the golden.genome.tar.gz file, we have a refs folder. The refs folder contains the reference genome (genome.fa), reference transcriptome (transcriptome.fa), and annotations (features.gff) for the Golden Snidget.

ls refs

features.gff  genome.fa  transcripts.fa

The Golden Snidget sequencing reads are located at http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz. Can you download and extract?

Solution

Download

wget http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz

OR

curl http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz -o golden.reads.tar.gz

Unpack

tar -xvf golden.reads.tar.gz

After downloading and unpacking golden.reads.tar.gz, what do you see?

Solution

ls -l

We see the two tar.gz files that were downloaded and a new folder called reads.

total 117384
-rw-rw-r-- 1 joe joe     57462 Feb  5  2020 golden.genome.tar.gz
-rw-rw-r-- 1 joe joe 120138017 Oct 25 00:18 golden.reads.tar.gz
drwxrwxr-x 1 joe joe       336 Oct 25 00:19 reads
drwxrwxr-x 1 joe joe        70 Oct 25 00:06 refs

The reads folder contains the FASTQ (fq) files for this dataset. We will be working with these in the next lesson.

ls reads

BORED_1_R1.fq  BORED_2_R1.fq  BORED_3_R1.fq  EXCITED_1_R1.fq  EXCITED_2_R1.fq  EXCITED_3_R1.fq
BORED_1_R2.fq  BORED_2_R2.fq  BORED_3_R2.fq  EXCITED_1_R2.fq  EXCITED_2_R2.fq  EXCITED_3_R2.fq

Find out some details about the Golden Snidget genome and transcriptome

Now that we have downloaded the Golden Snidget reference files let's take a moment to get to know the references. First, change into the refs folder. How do we do this from the ~/biostar_class/snidget directory.

Solution

cd refs

How many sequences are in the Golden Snidget genome?

How many bases are in the Golden Snidget genome (ie. what is the genome size for the Golden Snidget)?

Solution

seqkit stats genome.fa

file                         format  type  num_seqs  sum_len  min_len  avg_len  max_len
/data/golden/refs/genome.fa  FASTA   DNA          1  128,756  128,756  128,756  128,756

The Golden Snidget genome has 1 sequence composed of 128,756 bases.

How many transcripts does the Golden Snidget have?

How many bases does the longest transcript have?

How many bases does the shortest transcript have?

Solution

seqkit stats transcripts.fa

file                              format  type  num_seqs  sum_len  min_len  avg_len  max_len
/data/golden/refs/transcripts.fa  FASTA   DNA         92   82,756      273    899.5    2,022

The Golden Snidget has 92 transcripts.

The longest transcript is 2,022 baes.

The shortest transcript is 273 bases.

Note: if you grep for > in Unix, be sure to put quotes around it.

grep ">" NOT grep >

Is there an alternative way to get the number of transcripts in the Golden Snidget (ie. without using seqkit)?

Solution

grep ">" transcripts.fa | wc -l

The goal of the Golden Snidgt RNA sequencing experiment is to find genes that are differentially expressed when the Golden Snidget is EXCITED compared to when it is BORED. Looking at the transcript names, what can you tell about a particular transcript in the EXCITED and BORED state? What would you expect the differential gene expression analysis to tell us when we get to this later on in the course? You will need to take a look at the features.gff file for this.

Solution

less features.gff

Note: hit Q to exit less.

Look at the gene or transcript names on the last column of the annotations file (gene names and transcripts names are the same in this dataset). Take for example AAA-750000-UP-4, the transcript name is telling us that

In the BORED state, there are 750000 copies of the transcript
In the EXCITED state, the expression of this transcript is UP 4 times (4x750000=3000000 copies)
Where the expression in the EXCITED state is down we will see DOWN in the transcript names

Visualizing Golden Snidget genome and transcriptome in IGV

Let's open IGV locally on our computer. Then we will copy the Golden Snidget refs folder to our public directory so we can download and use these locally. Remember the location on your computer to which the files were downloaded. See if you can remember how to copy the Golden Snidget reference to the public directory. Hint, it may be easier to do this from the ~/biostar_class/snidget directory.

Solution

cd ~/biostar_class/snidget

cp -r refs ~/public

After we have successfully copied the refs folder to public, click to open it and right click on each of the files and click

"Save link as" (if on Google Chrome or Firefox) - include the appropriate file extension when saving
"Download Linked File as" (if on Safari)

Load the Golden Snidget reference

The first step in using IGV is to load our reference genome. Take some time to see if you recall how to do this.

Solution

After loading the genome, let's view the transcripts in IGV and see how they line up in the genome. Take a moment to see if you recall how to do this.

Solution

Then choose the features.gff file and the result will look like the image below.

Take sometime to explore IGV (zoom in, search for a transcript, pan around...)