Skip to content

Lesson 8: Reference genomes and genome annotations used in RNA sequencing

Lesson 7 Review

Lesson 7 introduced the basics of RNA sequencing, including experimental considerations and basic ideas behind data analysis. In lessons 8 through 15 participants will get hands-on experience with RNA sequencing analysis.

Learning Objectives

After this class, participants will be able to

  • Describe a reference genome and why it is used in bioinformatics.
  • Use Unix commands to learn about the reference genome.
  • Understand the importance of a GTF file in RNA sequencing analysis.

Example Data

The data used are located in /data/classes/BTEP/hcc1395_b4b on Biowulf. This data was obtained from rnabio.org and essentially a transcriptomic profiling study comparing between tumor and normal HCC1395 breast cancer cells. There a 3 normal and 3 tumor samples for each condition.

Sign onto Biowulf

Before following along in this analysis, please sign onto Biowulf using the ssh construct below where "user" is the class participant's Biowulf user id.

ssh user@biowulf.nih.gov

Request an Interactive Session

Work Biowulf's request an interactive session to work compute node for compute intensive tasks using the sinteractive command below where:

  • --mem: enables the user to request the desired amount of RAM for analysis (12gb requested in this example).
  • --gres: enables the user to request generic comput resources such as local temporary storage space (lscratch with 5gb requested in this example).
sinteractive --mem=12gb --gres=lscratch:5

Copy the Data to User's /data folder

The data is located in the hcc1395_b4b folder of /data/classes/BTEP. Copy this to the participants' own /data directory in Biowulf prior to proceeding using the cp command with the -r option for copying folders. The directory path where the data resides is /data/classes/BTEP/hcc1395_b4b and the destination to which it needs to copied is /data/user/hcc1395_b4b. Be sure to change "user" to the participant's own Biowulf user id.

cp -r /data/classes/BTEP/hcc1395_b4b /data/user/hcc1395_b4b

Use the ls command to list the content in the participant's Biowulf /data folder to make sure the copy was successful.

ls
hcc1395_b4b

Explore the hcc1395_b4b Folder

Change into the hcc1395_b4b folder.

cd hcc1395_b4b

How many files and folders are in this?

ls -l

According to the ls -l output, there are two folders.

drwxr-x---. 2 user user 4096 Nov 21 18:50 reads
drwxr-x---. 2 user user 4096 Nov 21 18:50 references

Stay in hcc1395_b4b.

What is in the reads folder?

ls -1 reads

The read folder contains the FASTQ files. There are 12 of these, two for each sample as this study generated paired end reads.

hcc1395_normal_rep1_R1.fq
hcc1395_normal_rep1_R2.fq
hcc1395_normal_rep2_R1.fq
hcc1395_normal_rep2_R2.fq
hcc1395_normal_rep3_R1.fq
hcc1395_normal_rep3_R2.fq
hcc1395_tumor_rep1_R1.fq
hcc1395_tumor_rep1_R2.fq
hcc1395_tumor_rep2_R1.fq
hcc1395_tumor_rep2_R2.fq
hcc1395_tumor_rep3_R1.fq
hcc1395_tumor_rep3_R2.fq

Definition

In paired end sequencing, a template or unknown nucleotide fragment is sequenced from both ends, thus generating two FASTQ files per sample. A gap between the two reads is usually present in this sequencing protocol.
Image source: https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/design-considerations/.

What is in the references folder?

ls references

This folder contains the human chromosome 22 reference genome (22.fa) and the corresponding gtf annotation file (22.gtf). The adapters used in the study is found in illumina_multiplex.fa.

22.fa  22.gtf  illumina_multiplex.fa

Change into the references folder to learn more the files.

cd references

The Reference Genome

The reference genome is a completely assembled sequence that we can compare other sequences to. For high throughput sequencing, we need the known sequences so that we can find out where in the genome each of the sequencing reads came from. The reference genome in a way acts like a template that we can follow to reconstruct the genome of the unknown. In other words, think of the reference genome as a picture of the completed puzzle that helps us assemble the actual puzzle, by allowing us to overlap the pieces to see where they fit in the completed version. The file 22.fa contains is the reference genome for human chromosome 22.

Use the head command to view the first 10 lines of 22.fa.

head 22.fa

A FASTA (.fa) file starts with a header line that contains a ">" followed by the name or some annotation of the sequence (ie. chr22). The first 10 lines indicate unknown sequences (denoted by N's).

>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Note

The header line in a FASTA file can provide more information, depending on how the sequence was curated. As an example, in the human ADSL transcript nucleotide sequence, the header line tells us the accession number or sequence ID (NM_001363840.3), species in which the sequence was derived (Homo sapiens), name of the gene (ADSL) where the transcript originated, and that this is a mRNA sequence.

How many lines does 22.fa contain? To find out, use the wc command with -l option. The -l options tells wc to count the number of lines in a file.

wc -l 22.fa

The output of wc -l is the line number count followed by the name of the input file (ie. 22.fa). The result below indicates that 22.fa had 846976 lines.

846976 22.fa

But are there actual sequences in 22.fa (ie. A,T,C,or,G)? To find out, use the grep command with the following options for pattern searching.

  • -n: instructs grep to retrieve the line number of the file where the pattern is found.
  • -E: treats the pattern as an extended regular expression.
  • Within the single quotes is the search pattern A, T, C, or G where each is separated by "|".
  • The grep output is sent via | (pipe) to the head command to print the first 10 lines of the results.
grep -n -E 'A|T|C|G' 22.fa | head

The results of the above grep command shows that the first line with actual nucleotide sequences is 175168.

175168:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGAATTCTTGTGTTTATATAA
175169:TAAGATGTCCTATAATTTCTGTTTGGAATATAAAATCAGCAACTAATATGTATTTTCAAA
175170:GCATTATAAATACAGAGTGCTAAGTTACTTCACTGTGAAATGTAGTCATATAAAGAACAT
175171:AATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAAGGTTTCATC
175172:AGTAATTCATTATATCAAAATGCTCCAGGCCAGGCGTGGTGGCTTATGCCTGTAATCCCA
175173:GCACTTTGGGAGGTCGAAGTGGGCGGATCACTTGAGGTCAGGAGTTGGAGACTAGCCTGG
175174:CCAACATGATGAAACCCCGTCTCTAATAATAATAATAAAAAAAAATTAGCTGGGTGTGGT
175175:GGTGGGCAACTGTAATCTCAGCTAATCAGGAGGCTGAGGCAGAAGAATTGCTTGAACGTG
175176:GAAGACAGAGTTTACAGTGTGCCAAGATCACACCACCCTACTCCAACTTGGGTGACAGAG
175177:CAAGACTCAGTCTCAAGGAAAAAAAAAAGCTCGAAAAATGTTTGCTTATTTTGGTAAAAT

A question we might ask about the reference genome is how big is the reference (ie. how many bases)? To answer this, we can use the tool seqkit and it's stats feature.

Why do we need the size of the genome?

Prior to the sequencing experiment, the size of the genome will help us determine the number of reads needed to achieve a certain coverage (https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf).

After the experiment, we could use the size of our genome along with other information to decide the computing resources (ie. time and memory) needed for our analysis. Chromosome 22 is the second smallest in humans, so it would be faster to align to this than the entire human genome. For the sake of time in this class, that's why we chose to align to just this chromosome.

module load seqkit
seqkit stat 22.fa
file   format  type  num_seqs     sum_len     min_len     avg_len     max_len
22.fa  FASTA   DNA          1  50,818,468  50,818,468  50,818,468  50,818,468

GTF Annotation File

What is in the 22.gtf file? A gtf file is known as Genome Transfer File, which is essentially a tab delimited file (ie. columns in the file are separated by tab). It informs us of where different features such as genes, transcripts, exons, and coding sequences are found in a genome.

A gtf file is needed for couple of reasons.

First, even though we will map the RNA sequencing reads to a genome, we need to know which genomic features (ie. genes) the reads are aligning to generate some metric for expression (counts) and then perform differential expression analysis.

Second, because some of the sequencing reads can map to two exons, if we use a splice-aware-aligner, the information in the gtf file can be used to recognize the exon-exon boundaries (see Figure 1).

Otherwise, those reads that map to two exons will not be mapped and we end up losing information. See https://useast.ensembl.org/info/website/upload/gff.html for required information in a gtf file.

Table 2 shows the first three lines of 22.gtf.

The "score" column values represent the confidence in a feature existing in that genomic position. In this example the score contains a dot ".", which represents a missing value.

The frame column provides reading frame information. Where a "." appears in the gtf table, this means that the information is not available.

In Table 2, we are looking at gene ENSG00000277248.1, where the first line tells us that the feature is a gene, and the lines below this shows the transcript products for the gene as well as exons associated for that transcript.

Table 2: Preview of information presented in a gtf file

CHROMOSOME DATA SOURCE FEATURE START END SCORE STRAND FRAME ATTRIBUTE
chr22 ENSEMBL gene 10736171 10736283 . - . gene_id "ENSG00000277248.1"; gene_type "snRNA"; gene_status "NOVEL"; gene_name "U2"; level 3;
chr22 ENSEMBL transcript 10736171 10736283 . - . gene_id "ENSG00000277248.1"; transcript_id "ENST00000615943.1"; gene_type "snRNA"; gene_status "NOVEL"; gene_name "U2"; transcript_type "snRNA"; transcript_status "NOVEL"; transcript_name "U2.14-201"; level 3; tag "basic"; transcript_support_level "NA";
chr22 ENSEMBL exon 10736171 10736283 . - . gene_id "ENSG00000277248.1"; transcript_id "ENST00000615943.1"; gene_type "snRNA"; gene_status "NOVEL"; gene_name "U2"; transcript_type "snRNA"; transcript_status "NOVEL"; transcript_name "U2.14-201"; exon_number 1; exon_id "ENSE00003736336.1"; level 3; tag "basic"; transcript_support_level "NA";

Figure 1: A sequencing read (red fragments) aligning two exons (e1 and e2). Modified from: https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rb-rnaseq/tutorial.html


To view the gtf or other tabular data in Unix, we can cat the file and then pipe or send the output to the column command, which allows us to print the columns so that they are nicely aligned. In this case, we use the -t option in column to determine the number of columns in the input and by default white space is used to separate the columns. Piping to less -S allows us to truncate some really long lines like those found in the attributes column of the gtf file and this also allows us to scroll horizontally to view additional columns. Hit Q to exit the cat command.

cat 22.gtf | column -t | less -S
chr22  ENSEMBL  gene            10736171  10736283  .  -  .  gene_id  "ENSG00000277248.1";   gene_type      "snRNA";                               gene_status  "NOVEL";                >
chr22  ENSEMBL  transcript      10736171  10736283  .  -  .  gene_id  "ENSG00000277248.1";   transcript_id  "ENST00000615943.1";                   gene_type    "snRNA";                >
chr22  ENSEMBL  exon            10736171  10736283  .  -  .  gene_id  "ENSG00000277248.1";   transcript_id  "ENST00000615943.1";                   gene_type    "snRNA";                

Visualizing the reference genome and annotations

One of the things we will be doing quite often is to visualize genomics data using some sort of genome browser. In this course series, we will use a popular one called Integrative Genome Viewer(IGV). Let's visualize human chromosome 22 in IGV as a way to start becoming acquainted with the software.

Here, we will be using IGV that is installed on our local machine.

Instructions to install IGV:

Submit ticket to service.cancer.gov to get IGV installed on your machine.

We also need to download 22.fa and 22.gtf to our local desktop. To do this copy both into the ~/public directory.

cp 22.fa ~/public
cp 22.gtf ~/public

From there, download 22.fa and 22.gtf to our local machine. Note where the files were downloaded.

When we open IGV, we will see the window shown in (Figure 2).

Figure 2


IGV comes preloaded with several genomes, but if we do not see the one we need in the drop down menu we can always load it from the Genomes drop down on the menu bar where it gives us some options including loading the genome from local file or from the web or URL (Figure 3). Compatible file formats include FASTA, JSON, or ".genome".

Figure 3:


To load data into IGV, we will choose one of the options in the File drop down in the menu bar (see Figure 4). Note that from the File drop down, we can take snap shots of our view and save as either PNG or SVG images.

Figure 4:


Since IGV loaded hg19 upon startup, we will load the chromosome 22 reference genome using the Load from File option under the Genomes drop down in the menu bar (Figure 5).

Figure 5:


Next, we will load the 22.gtf file onto IGV so we can see the genes aligned to the reference genome (Figure 6). The blue rectangles represent genes and transcripts, we can zoom in to look at ENSG00000280363.1 by searching for this gene ID in the Go box (we can actually search by coordinates, ID, or name).

Figure 6


In Figure 7, we zoom into ENSG00000280363.1 on IGV, where ENSG00000280363.1 is the Ensembl accession number for gene CU104787.1. This gene overlaps ENSG00000279973.1, which is bage5. If we click on the solid blue line for each of the genes, a dialog box with more information will appear. The arrows on the genes point to the direction in which the gene was transcribed (these two genes were transcribed in different directions). Note that even though we search by gene ID in the Go box, the genomic coordinates will be displayed after IGV finds what we are looking for.

Figure 7


The transcripts are depicted by solid rectangles separated by lines (Figure 8). The solid rectangles are exons and the lines connecting are introns. If we click on the transcripts, a box pops up and we get more information regarding the transcript.

Figure 8


Within the exons, narrower solid rectangles represent untranslated regions or UTRs (Figure 9).

Figure 9


Among the many features of IGV, is the ability for users to zoom in close enough to see the bases (Figure 10).

Figure 10