Lesson 9 Practice
Objectives
In this practice session, we will apply our knowledge to
- learn about the reference genome and annotation file for the Golden Snidget dataset
- visualize the Golden Snidget genome using the Integrative Genome Viewer (IGV) - instructor will demo this and you can practice on your own after getting IGV installed.
Create a directory for the Golden Snidget analysis
Before getting started, let's create a folder called snidget within the ~/biostar_class directory to conduct our analysis. To do this, we need to first go to the ~/biostar_class folder, how do you check where you are and change into this folder.
Solution
pwd
If you are not in the ~/biostar_class folder, then change into it.
cd ~/biostar_class
Next, we will create the directory snidget within the ~/biostar_class folder. Take a moment to see if you can do this.
solution
mkdir snidget
Now, we need to change into the "snidget" directory
solution
cd snidget
Where is my data?
The Golden Snidget reference genome is located at http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz. Can you download and extract?
Solution
Download
wget http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz
OR
curl http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz -o golden.genome.tar.gz
Unpack
tar -xvf golden.genome.tar.gz
After unpacking golden.genome.tar.gz, what do you see?
Solution
ls -l
total 60
-rw-rw-r-- 1 joe joe 57462 Feb 5 2020 golden.genome.tar.gz
drwxrwxr-x 1 joe joe 70 Oct 25 00:06 refs
In addition to the golden.genome.tar.gz file, we have a refs folder. The refs folder contains the reference genome (genome.fa), reference transcriptome (transcriptome.fa), and annotations (features.gff) for the Golden Snidget.
ls refs
features.gff genome.fa transcripts.fa
The Golden Snidget sequencing reads are located at http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz. Can you download and extract?
Solution
Download
wget http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz
OR
curl http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz -o golden.reads.tar.gz
Unpack
tar -xvf golden.reads.tar.gz
After downloading and unpacking golden.reads.tar.gz, what do you see?
Solution
ls -l
We see the two tar.gz files that were downloaded and a new folder called reads.
total 117384
-rw-rw-r-- 1 joe joe 57462 Feb 5 2020 golden.genome.tar.gz
-rw-rw-r-- 1 joe joe 120138017 Oct 25 00:18 golden.reads.tar.gz
drwxrwxr-x 1 joe joe 336 Oct 25 00:19 reads
drwxrwxr-x 1 joe joe 70 Oct 25 00:06 refs
The reads folder contains the FASTQ (fq) files for this dataset. We will be working with these in the next lesson.
ls reads
BORED_1_R1.fq BORED_2_R1.fq BORED_3_R1.fq EXCITED_1_R1.fq EXCITED_2_R1.fq EXCITED_3_R1.fq
BORED_1_R2.fq BORED_2_R2.fq BORED_3_R2.fq EXCITED_1_R2.fq EXCITED_2_R2.fq EXCITED_3_R2.fq
Find out some details about the Golden Snidget genome and transcriptome
Now that we have downloaded the Golden Snidget reference files let's take a moment to get to know the references. First, change into the refs folder. How do we do this from the ~/biostar_class/snidget directory.
Solution
cd refs
How many sequences are in the Golden Snidget genome?
How many bases are in the Golden Snidget genome (ie. what is the genome size for the Golden Snidget)?
Solution
seqkit stats genome.fa
file format type num_seqs sum_len min_len avg_len max_len
/data/golden/refs/genome.fa FASTA DNA 1 128,756 128,756 128,756 128,756
The Golden Snidget genome has 1 sequence composed of 128,756 bases.
How many transcripts does the Golden Snidget have?
How many bases does the longest transcript have?
How many bases does the shortest transcript have?
Solution
seqkit stats transcripts.fa
file format type num_seqs sum_len min_len avg_len max_len
/data/golden/refs/transcripts.fa FASTA DNA 92 82,756 273 899.5 2,022
The Golden Snidget has 92 transcripts.
The longest transcript is 2,022 baes.
The shortest transcript is 273 bases.
Note: if you grep for > in Unix, be sure to put quotes around it.
grep ">" NOT grep >
Is there an alternative way to get the number of transcripts in the Golden Snidget (ie. without using seqkit)?
Solution
grep ">" transcripts.fa | wc -l
92
The goal of the Golden Snidgt RNA sequencing experiment is to find genes that are differentially expressed when the Golden Snidget is EXCITED compared to when it is BORED. Looking at the transcript names, what can you tell about a particular transcript in the EXCITED and BORED state? What would you expect the differential gene expression analysis to tell us when we get to this later on in the course? You will need to take a look at the features.gff file for this.
Solution
less features.gff
Note: hit Q to exit less.
Look at the gene or transcript names on the last column of the annotations file (gene names and transcripts names are the same in this dataset). Take for example AAA-750000-UP-4, the transcript name is telling us that
- In the BORED state, there are 750000 copies of the transcript
- In the EXCITED state, the expression of this transcript is UP 4 times (4x750000=3000000 copies)
- Where the expression in the EXCITED state is down we will see DOWN in the transcript names
Visualizing Golden Snidget genome and transcriptome in IGV
Let's open IGV locally on our computer. Then we will copy the Golden Snidget refs folder to our public directory so we can download and use these locally. Remember the location on your computer to which the files were downloaded. See if you can remember how to copy the Golden Snidget reference to the public directory. Hint, it may be easier to do this from the ~/biostar_class/snidget directory.
Solution
cd ~/biostar_class/snidget
cp -r refs ~/public
After we have successfully copied the refs folder to public, click to open it and right click on each of the files and click
- "Save link as" (if on Google Chrome or Firefox) - include the appropriate file extension when saving
- "Download Linked File as" (if on Safari)
Load the Golden Snidget reference
The first step in using IGV is to load our reference genome. Take some time to see if you recall how to do this.
Solution
After loading the genome, let's view the transcripts in IGV and see how they line up in the genome. Take a moment to see if you recall how to do this.
Solution
Then choose the features.gff file and the result will look like the image below.
Take sometime to explore IGV (zoom in, search for a transcript, pan around...)