Lesson 9: Aligning Next Generation Sequencing (NGS) Data to Genome

Lesson 8 Review

In Lesson 8, participants were introduced to the tool Trimmomatic, which can be used to remove low quality reads and adapter contamination from sequencing data. For the hcc1395 dataset, only adapters had to be removed. Subsequent QC of the trimmed data revealed that adapter contamination was removed. Thus, the next step will determine where in the genome each of the sequences in the trimmed FASTQ files for hcc1395 came from in a process known as the alignment or mapping.

Learning objectives

After this lesson, participants will be able to:

Describe a reference genome.
Understand why a splice aware aligner is needed in RNA sequencing.
Perform alignment using the tool HISAT2.

Sign onto Biowulf

Prior to getting started, sign onto Biowulf.

ssh user@biowulf.nih.gov

Next, change into the /data/user/hcc1395_b4b folder.

cd /data/user/hcc1395_b4b

Finally, request an interactive session with the following resources. The option --cpus-per-task is used to request 6 CPUs on Biowulf in the sinteractive command below in addition to the 12 gb of memory and 10 gb of local temporary storage.

sinteractive --cpus-per-task 6 --mem=12gb --gres=lscratch:10

The Reference Genome

The reference genome is a completely assembled sequence (ie. the order in which the nucleotides are arranged is known). For high throughput sequencing, the known sequence help scientists find out where in the genome each of the reads came from. The reference genome in a way acts like a template that researchers can follow to reconstruct the genome of the unknown. In other words, think of the reference genome as a picture of the completed puzzle that helps guide the assembly of the actual puzzle, by allowing the overlap of pieces to see where they fit in the completed version. The file 22.fa contains the reference genome for human chromosome 22.

Tip

Biowulf staff has downloaded and made the genome as well as annotations available on the cluster (see https://hpc.nih.gov/refdb/index.php) so users generally will not need to download anything on their own.

Change into the references folder in /data/user/hcc1395_b4b for this example. Since /data/user/hcc1395_b4b should be the current working directory, just do the following.

cd references

Use the head command to view the first 10 lines of 22.fa.

head 22.fa

A FASTA (.fa) file starts with a header line that contains a ">" followed by the name or some annotation of the sequence (ie. chr22). The first 10 lines indicate unknown sequences (denoted by N's).

>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Note

The header line in a FASTA file can provide more information, depending on how the sequence was curated. As an example, in the human ADSL transcript nucleotide sequence, the header line provides the accession number or sequence ID (NM_001363840.3), species in which the sequence was derived (Homo sapiens), name of the gene (ADSL) where the transcript originated, and that this is a mRNA sequence.

How many lines does 22.fa contain? To find out, use the wc command with -l option. The -l options tells wc to count the number of lines in a file.

wc -l 22.fa

The output of wc -l is the line number count followed by the name of the input file (ie. 22.fa). The result below indicates that 22.fa had 846976 lines.

846976 22.fa

But are there actual sequences in 22.fa (ie. A, T, C, or, G)? To find out, use the grep command with the following options for pattern searching.

-n: instructs grep to retrieve the line number of the file where the pattern is found.
-E: treats the pattern as an extended regular expression.
Within the single quotes is the search pattern A, T, C, or G where each is separated by "|".
The grep output is sent via | (pipe) to the head command to print the first 10 lines of the results.

grep -n -E 'A|T|C|G' 22.fa | head

The results of the above grep command shows that the first line with actual nucleotide sequences is 175168.

175168:NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGAATTCTTGTGTTTATATAA
175169:TAAGATGTCCTATAATTTCTGTTTGGAATATAAAATCAGCAACTAATATGTATTTTCAAA
175170:GCATTATAAATACAGAGTGCTAAGTTACTTCACTGTGAAATGTAGTCATATAAAGAACAT
175171:AATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAAGGTTTCATC
175172:AGTAATTCATTATATCAAAATGCTCCAGGCCAGGCGTGGTGGCTTATGCCTGTAATCCCA
175173:GCACTTTGGGAGGTCGAAGTGGGCGGATCACTTGAGGTCAGGAGTTGGAGACTAGCCTGG
175174:CCAACATGATGAAACCCCGTCTCTAATAATAATAATAAAAAAAAATTAGCTGGGTGTGGT
175175:GGTGGGCAACTGTAATCTCAGCTAATCAGGAGGCTGAGGCAGAAGAATTGCTTGAACGTG
175176:GAAGACAGAGTTTACAGTGTGCCAAGATCACACCACCCTACTCCAACTTGGGTGACAGAG
175177:CAAGACTCAGTCTCAAGGAAAAAAAAAAGCTCGAAAAATGTTTGCTTATTTTGGTAAAAT

How big is the 22.fa reference genome (ie. how many bases)? To answer this, use the tool seqkit and it's stats feature.

Why is the size of the genome important?

Prior to the sequencing experiment, the size of the genome will help us determine the number of reads needed to achieve a certain coverage (https://www.illumina.com/documents/products/technotes/technote_coverage_calculation.pdf).

After the experiment, we could use the size of the genome along with other information to decide the computing resources (ie. time and memory) needed for analysis. Chromosome 22 is the second smallest in humans, so it would be faster and require less compute power to align to this than the entire human genome, thus this dataset was chosen for the class.

module load seqkit

seqkit stat 22.fa

file   format  type  num_seqs     sum_len     min_len     avg_len     max_len
22.fa  FASTA   DNA          1  50,818,468  50,818,468  50,818,468  50,818,468

Go back up one folder to /data/user/hcc1395_b4b.

cd ..

Aligning hcc1395 Sequencing Data to Reference Genome

The tool HISAT2 will be used to align the trimmed hcc1395 FASTQ files to the human chromosome 22 reference. RNA sequencing analyses require the use of splice aware aligners in order to map sequences that span across exons.

A sequencing read (red fragments) aligning two exons (e1 and e2). Modified from: https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rb-rnaseq/tutorial.html

Build an Index

Users will need to build an index of the reference genome prior to aligning using HISAT2. This step essentially breaks down the reference into manageable pieces so that the alignment process is made more efficient.

Load HISAT2.

module load hisat2

 This is HISAT2 v 2.2.1, built for NCBI NGS 3.0.0
[+] Loading samtools 1.21  ...

Stay in /data/user/hcc1395_b4b for this exercise.

To build the index for HISAT2, use the hisat2-build command where the arguments are:

Path to the reference genome FASTA (or .fa) file: the reference genome file in this case is 22.fa and it is located in the references folder so the path starting from /data/user/hcc1395_b4b is references/22.fa.
The folder to write the indices: in this case, write the indices to the reference folder. Thus, the path starting from /data/user/hcc1395_b4b is references/22. Note that an extension is not needed as HISAT2 will append .ht to the index files.

hisat2-build references/22.fa references/22

After indexing of 22.fa has been completed, use ls with the -1 option to view the contents of the reference folder one item per line and confirm that 8 index files with .ht extension were generated in this step.

ls -1 references/

22.1.ht2
22.2.ht2
22.3.ht2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2

Alignment

Make a new folder hcc1395_hisat2 to store the HISAT2 alignment results.

mkdir hcc1395_hisat2

Stay in /data/user/hcc1395_b4b for this exercise.

To perform alignment for all samples at once, the parallel command will be used. Below is the explanation of the components.

cat hcc1395_sample_ids.txt: print the hcc1395 sample ids and | will send the output to parallel rather than printing to screen.
The HISAT2 alignment command is enclosed within double quotes within parallel. The alignment starts with the command hisat2 and arguments and options below are include:
-j: enables users to specify how many jobs to run in parallel (6 in this case since there are 6 samples).
-x: prompts user to enter the path to the reference genome indice (ie. references/22, note that the ht2 extension is not needed)
-1: prompts for the first read in the pair for paired end sequencing.
- The data is in the trimmed_reads folder.
- {} is used as a place holder for the sample ids sent by cat.
-2: prompts for the second read in the pair for paired end sequencing.
-S: prompts the output SAM file path. The alignment output will be stored in the folder hcc1395_hisat2.
--summary-file: prompts for the text (txt) file that summarizes alignment statistics. The summary will be stored in the folder hcc1395_hisat2.

cat hcc1395_sample_ids.txt | parallel -j 6 "hisat2 -x references/22 -1 trimmed_reads/{}_trimmed_R1.fq -2 trimmed_reads/{}_trimmed_R2.fq -S hcc1395_hisat2/{}.sam --summary-file hcc1395_hisat2/{}_hisat2_summary.txt"

After alignment completes, list the contents of hcc1395_hisat2 to see what outputs were generated.

ls -1 hcc1395_hisat2

The folder hcc1395_hisat2 contains the aligned data in .sam format and alignment summary text files (ie. those labeled with summary.txt).

hcc1395_normal_rep1_hisat2_summary.txt
hcc1395_normal_rep1.sam
hcc1395_normal_rep2_hisat2_summary.txt
hcc1395_normal_rep2.sam
hcc1395_normal_rep3_hisat2_summary.txt
hcc1395_normal_rep3.sam
hcc1395_tumor_rep1_hisat2_summary.txt
hcc1395_tumor_rep1.sam
hcc1395_tumor_rep2_hisat2_summary.txt
hcc1395_tumor_rep2.sam
hcc1395_tumor_rep3_hisat2_summary.txt
hcc1395_tumor_rep3.sam

Change into the hcc1395_hisat2 directory.

cd hcc1395_hisat2

Use cat to take a look at the alignment summary for hcc1395_normal_rep1.

cat hcc1395_normal_rep1_hisat2_summary.txt

331945 reads; of these:
  331945 (100.00%) were paired; of these:
    43449 (13.09%) aligned concordantly 0 times
    283794 (85.49%) aligned concordantly exactly 1 time
    4702 (1.42%) aligned concordantly >1 times
    ----
    43449 pairs aligned concordantly 0 times; of these:
      10336 (23.79%) aligned discordantly 1 time
    ----
    33113 pairs aligned 0 times concordantly or discordantly; of these:
      66226 mates make up the pairs; of these:
        35020 (52.88%) aligned 0 times
        30758 (46.44%) aligned exactly 1 time
        448 (0.68%) aligned >1 times
94.73% overall alignment rate

Users will see the alignment statistics shown above and reports are generated for all samples. The first line of the HISAT2 alignment statistics says of 331945 reads (100.00%) were paired. Recall from FASTQC that read 1 and read 2 FASTQ files for hcc1395_normal_rep1 each has 331945 reads after trimming. So the first line in the HISAT2 alignment statistics is saying that out of all the reads from read 1 and read 2 FASTQ files of hcc1395_normal_rep1, there are 331945 pairs. This means there are 331945x2 or 663890 reads for the hcc1395_normal_rep1 sample.

Following the first line of the HISAT2 alignment statistics, users will see some terminologies like concordant or discordant mapped reads. See the figure below for a visual explanation.

Source: Benjamin J. Raphael, Chapter 6: Structural Variation and Medical Genomics, PLOS Computational Biology, December 27, 2012

A concordant read pair is defined as those that align in an expected manner where the reads are oriented towards one another and the distance between the outer edges is within expected ranges.
A discordant read pair is defined as those that do not align as expected such as where the distance between the outer edges is smaller or larger than expected.

For the hcc1395_normal_rep1 sample, there are:

283794 pairs (567588 reads) that aligned concordantly once
4702 pairs (9404 reads) that aligned concordantly more than once - these are mapped to multiple parts of the genome and likely cause by duplicated or similiar regions in the genome
10336 paris (20672 reads) that aligned discordantly once
30758 reads that did align (neither concordantly or discordantly) once
448 reads that aligned (neither concordantly or discordantly) more than once

Sum up the number of reads that mapped in the above break down and divide by the total number of reads in the two FASTQ files for the hcc1395_normal_rep1 sample to get an overall alignment rate of 94.73%.

SAM File

Aligned sequencing data are stored in the form of SAM files. The SAM file is human readable so users can use cat, less, and head to view its content in the terminal.

SAM Header

SAM files always start off with metadata information and these lines start with @. SAMTOOLS can be used to view and manipulate SAM files.

module load samtools

To view the header for hcc1395_normal_rep1.sam, use the following samtools command construct, where:

head: this argument is used to view SAM file header.
hcc1395_normal_rep1.sam: SAM file in which the header should be viewed.

samtools head hcc1395_normal_rep1.sam

@HD includes
SAM file format information (in this case version 1.0 as indicated by VN:1.0)
Whether the alignment file has been sorted (in this case no as indicated by SO:unsorted)
@SQ provides reference information
SN denotes reference sequence name, which is chr22
LN is the reference length in bases, which is 50818468
@PG provides information about the program that was used to generate the alignment
ID is the program record identifier
PN is the program name (HISAT2 and version 2.2.1 was used as indicated next to VN)
CL is the command line call used to generate the alignment

After the metadata information, each SAM file contains 11 fields.

SAM File Information

samtools view hcc1395_normal_rep1.sam | head -1 | column -t | less -S

K00193:38:H3MYFBBXX:4:1101:10003:44458  99  chr22  31282436  60  151M  =  31282463  178  TTCCTTATGAAACAGGAAGAGTCCCTGGGCCCAGGCCTGGCCCACGGTTGTCAAGGCACATCATTGCCAGCAAGCTGAAGCATACCAGCAGCCACAACCTAGATCTCATTCCCAACCCAAAGTTCTGACTTCTGTACAAACTCGTTTCCAG  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKFKKFKKKKF<AAKKKKKKKKKKKKKKKKFKKKFKKKKKKKKKKKFKAFKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFKFFKKKKKKKKKKKKFKKKK  AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:151  YS:i:0  YT:Z:CP  NH:i:1

QNAME or query template name, which is essentially the sequencing read that was mapped onto a reference genome. Use the grep to search for any template in the SAM file and users will retrieve a sequence in the corresponding FASTQ file.
The second column is a FLAG that provides detail on mapping. See https://broadinstitute.github.io/picard/explain-flags.html to learn about these FLAGS. For instance, the first alignment in the hcc1395_normal_rep1 SAM file has a FLAG of 99, which indicates that:
1. The read is properly paired.
2. Read is mapped in proper pair.
3. The mate is in the reverse strand.
4. The read is the first in a pair.
Column three contains the name of our reference genome (ie. chr22)
Column four provides the left most genomic coordinate where the sequencing read maps.
The mapping quality (MAPQ) is provided in the fifth column (the higher the number, the less likely that the mapping is due to error); a value of 255 in this column means that the mapping quality is not available.
Column six presents the CIGAR string, which informs about the match/mismatch, insertion/deletion, etc. of the alignment.
Column seven is the reference sequence name of the primary alignment of the NEXT read in the template. A "=" will appear if this name is the same as the current (which is expected for paired reads)
The alignment position of the next read in the template is provided in column eight. When this is not available, the value is set to 0.
Column nine provides the template length (TLEN), which should reflect the DNA fragment that was size selected for during library preparation.
The tenth column is just the sequencing read (some are written as the reverse complement so be cautious. The FLAGS in column two will tell whether the sequence is reverse complemented).
The eleventh column is the Phred quality scores of the sequencing read.
For definition of optional fields, see https://samtools.github.io/hts-specs/.