Lesson 13: Aligning raw sequences to reference genome

Before getting started, remember to be signed on to the DNAnexus GOLD environment.

Lesson 11 Review

In Lesson 11 we learned to aggregate multiple FASTQC reports into one using MultiQC, which allows us to more easily interrogate the quality of sequencing data for multiple samples. We also learned to trim away adapters and poor quality reads in raw data using Trimmomatic (instructions for trimming using BBDuk are available in the Lesson 11 in content for you to look at even though we did not go over this).

Learning objectives

In this lesson, we will continue to use the Human Brain Reference (HBR) and Universal Human Reference (UHR) data and we will

Learn to align the sequencing data to reference genome using HISAT2, which is a splice aware aligner
Look at post alignment QC
Familiarize ourselves with the contents of alignment output
Learn to use SAMTOOLS to work with alignment output
Align sequences with BOWTIE2, which is not splice aware so we can visualize and compare to results obtained from HISAT2 using the Integrative Genome Viewer (IGV) in Lesson 14.

The skills learned in this lesson can be applied towards your own research and subsequent lessons in this course.

Creating an index for the reference genome

Let's start our exploration of sequencing read alignment by discussing the reference genome for human chromosome 22. For this, change into our ~/biostar_class/hbr_uhr/refs folder. In Lesson 9, we discussed why we need a reference genome when we are working with high throughput sequencing data. In high throughput sequencing, we obtain many sequence fragments in our experimental output and we do not know where in the genome these reads come from. The reference genome is a known standard that helps us reassemble these fragments.

cd ~/biostar_class/hbr_uhr/refs

As a review, when we list the contents of this folder, we will see that it contains the reference genome (22.fa) and annotation (22.gtf) for human chromosome 22.

ls

22.fa  22.gtf  ERCC92.fa  ERCC92.gtf

The first step in alignment is to create an index for the reference genome. Think of an index as a table of contents in a book. If we are searching for something in a book, we can either search from beginning to end and depending on the size of the book, could take a long time. Alternatively, we could use the table of contents to jump to and search only the relevant sections. Thus, an index allows the aligner to search more specifically and reduce computation time.

To align the HBR and UHR raw reads to chromosome 22 we will use a tool called HISAT2, which is a splice aware aligner used for RNA sequencing. This aligner will be able to handle the alignment of reads that fall on two exons. We will use the build feature of HISAT2 to create our index. Other aligners will have their own algorithm for indexing the reference genome.

To build the index for human chromosome 22, we type hisat2-build at the command prompt, followed by the name of FASTA file for the reference genome (22.fa in our case) and then the basename (ie. file name without extension) that we would like to use for our index (here we choose 22 as the basename).

hisat2-build 22.fa 22

After the index has been generated, we can list the contents of our biostar_class/hbr_uhr/refs folder to see if anything has changed. We use the -1 option with ls to list one item per row.

ls -1

Note that we now have HISAT2 genome indices, which are the 8 files that have extension "ht2".

22.1.ht2
22.2.ht2
22.3.ht2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2

Once the index for the human chromosome 22 reference has been created, change back into the ~/biostar_class/hbr_uhr folder.

cd ~/biostar_class/hbr_uhr

Then create a new folder called hbr_uhr_hisat2 and change into this. We will keep our alignment outputs in this folder.

mkdir hbr_uhr_hisat2

Creating environmental variables to reference the reads, refs, QC, and alignment subfolders in ~/biostar_class/hbr_uhr

Because we are now creating different folders that stores results from various stages in our data analysis, we could set up some environmental variables for these so we can more easily reference these folders while avoiding having to type long path names.

For now, we will create environmental variables for the reads, refs, QC, HISAT2 alignment and Bowtie2 aligment folder. To create an environmental variable, we use the export command, followed the variable name which assigned to a value (in this case a directory path) using "=". Note that the directory path is in double quotes.

The first part to the directory path is our home directory, usually denoted as /home/username, where username is the username you used to sign into the terminal. However, we anticipate that some in this class may prefer to copy and paste the code examples so to avoid having to construct a command for every username in this course, we will show you how to store your username as a variable and the command for setting environmental variables for our directory paths will reference the variable that has stored your username.

The command whoami is used to identify the username.

whoami

your username

Export the result of whoami, which is your username as an environmental variable called me.

export me="$(whoami)"

echo $me

your username

Now, when we export environmental variables, we can use /home/$me to reference the home directory.

export hbr_uhr_reads="/home/$me/biostar_class/hbr_uhr/reads"

export hbr_uhr_refs="/home/username/biostar_class/hbr_uhr/refs"

export hbr_uhr_qc="/home/username/biostar_class/hbr_uhr/QC"

export hbr_uhr_hisat2="/home/username/biostar_class/hbr_uhr/hbr_uhr_hisat2"

export hbr_uhr_bowtie2="/home/username/biostar_class/hbr_uhr/hbr_uhr_bowtie2"

To reference the directories using environmental variables, we use "$". For instance,

ls $hbr_uhr_qc

We see our FASTQC and MultiQC reports.

HBR_1_R1_fastqc.html  HBR_2_R1_fastqc.html  HBR_3_R1_fastqc.html  UHR_1_R1_fastqc.html  UHR_2_R1_fastqc.html  UHR_3_R1_fastqc.html  adapters
HBR_1_R1_fastqc.zip   HBR_2_R1_fastqc.zip   HBR_3_R1_fastqc.zip   UHR_1_R1_fastqc.zip   UHR_2_R1_fastqc.zip   UHR_3_R1_fastqc.zip   multiqc_report_hbr_uhr.html
HBR_1_R2_fastqc.html  HBR_2_R2_fastqc.html  HBR_3_R2_fastqc.html  UHR_1_R2_fastqc.html  UHR_2_R2_fastqc.html  UHR_3_R2_fastqc.html  multiqc_report_hbr_uhr_data
HBR_1_R2_fastqc.zip   HBR_2_R2_fastqc.zip   HBR_3_R2_fastqc.zip   UHR_1_R2_fastqc.zip   UHR_2_R2_fastqc.zip   UHR_3_R2_fastqc.zip

Performing alignment using HISAT2

Here, let's change back in the ~/biostar_class/hbr_uhr/hbr_uhr_hisat2 folder.

cd $hbr_uhr_hisat2

To align FASTQ files for one sample, we construct the HISAT2 command with the following options if wanted to align sequencing data for one sample.

The "-x" flag prompts us to enter the base name (ie. without extension) of genome index. The HISAT2 index in the ~/biostar_class/hbr_uhr/refs directory, which is represented by the environmental variable hbr_uhr_refs.
We specify files containing read 1 and read 2 of paired end sequencing after "-1" and "-2" flags, respectively. The reads are in ~/biostar_class/hbr_uhr/reads, which is represented by the environmental variable hbr_uhr_reads.
Using the "-S" flag, we indicate that we want to save the alignment results in the SAM format where SAM stands for Sequencing Alignment Mapped.

hisat2 -x $hbr_uhr_refs/22 -1 $hbr_uhr_reads/HBR_1_R1.fq -2 $hbr_uhr_reads/HBR_1_R2.fq -S HBR_1.sam

As HISAT22 is running the alignment, we get the alignment statistics below.

  118571 (100.00%) were paired; of these:
    56230 (47.42%) aligned concordantly 0 times
    61769 (52.09%) aligned concordantly exactly 1 time
    572 (0.48%) aligned concordantly >1 times
    ----
    56230 pairs aligned concordantly 0 times; of these:
      173 (0.31%) aligned discordantly 1 time
    ----
    56057 pairs aligned 0 times concordantly or discordantly; of these:
      112114 mates make up the pairs; of these:
        112061 (99.95%) aligned 0 times
        48 (0.04%) aligned exactly 1 time
        5 (0.00%) aligned >1 times
52.75% overall alignment rate

Let's breakdown the alignment statistics shown above for the sample HBR_1.

The first line of the HISAT2 alignment statistics says 118571 reads (100.00%) were paired. Recall from FASTQC that read 1 and read 2 FASTQ files for HBR_1 have 118571 reads, each (Figures 1 and 2). So the first line in the HISAT2 alignment statistics is telling us that out of all the reads from read 1 and read 2 FASTQ files of HBR_1, we have 118571 pairs, which agrees with what we know. Also, this means we have 118571x2 or 237142 reads for the HBR_1 sample.

Figure 1

Figure 2

Following the first line of the HISAT2 alignment statistics, we will see some terminologies like concordant or discordant mapped reads. See Figure 3 for a visual explanation.

A concordant read pair is defined as those that align in an expected manner where the reads are oriented towards one another and the distance between the outer edges is within expected ranges.
A discordant read pair is defined as those that do not align as expected such as where the distance between the outer edges is smaller or larger than expected.

For the HBR_1 sample, we have

61769 pairs (123538 reads) that aligned concordantly once
572 (1144 reads) pairs that aligned concordantly more than once - these are mapped to multiple parts of the genome and likely cause by duplicated regions in the genome
173 (346 reads) pairs that aligned discordantly once
48 reads that did align (neither concordantly or discordantly) once
5 reads that aligned (neither concordantly or discordantly) more than once

If we sum up the number of reads that mapped in the above break down and divide by the total number of reads in the two FASTQ files for the HBR_1 sample then we should get an overall alignment rate of 52.75%.

Figure 3: Source: Benjamin J. Raphael, Chapter 6: Structural Variation and Medical Genomics, PLOS Computational Biology, December 27, 2012

If we include the option --summary-file in the HISAT2 command, we can specify a file name to save the alignment statistcs.

Before moving further, let's use the parallel command to create a text file that contains the IDs for the HBR and UHR samples. This will allow us to align many FASTQ files at the same time.

parallel echo {1}_{2} ::: HBR UHR ::: 1 2 3 > $hbr_uhr_reads/ids.txt

cat $hbr_uhr_reads/ids.txt

HBR_1
HBR_2
HBR_3
UHR_1
UHR_2
UHR_3

To align all of the HBR and UHR FASTQ files we can take advantage of the parallel command. Let's use the --summary-file option to store the alignment statistics. We will see why this is useful in a bit.

cat $hbr_uhr_reads/ids.txt | parallel "hisat2 -x $hbr_uhr_refs/22 -1 $hbr_uhr_reads/{}_R1.fq -2 $hbr_uhr_reads/{}_R2.fq -S {}.sam --summary-file {}_hisat2_summary.txt"

After alignment with HISAT2, let's list the contents of the hbr_uhr_hisat2 directory to see what has changed.

ls -1

HBR_1.sam
HBR_1_hisat2_summary.txt
HBR_2.sam
HBR_2_hisat2_summary.txt
HBR_3.sam
HBR_3_hisat2_summary.txt
UHR_1.sam
UHR_1_hisat2_summary.txt
UHR_2.sam
UHR_2_hisat2_summary.txt
UHR_3.sam
UHR_3_hisat2_summary.txt

If we print the alignment summary file for HBR_1 (HBR_1_hisat2_summary.txt) then we should see the same statistics that we saw earlier.

cat HBR_1_hisat2_summary.txt

Recall from Lesson 11 that we can include summaries from post alignment steps into MultiQC reports. This is why we create the summary files for the HISAT2 alignments for the HBR and UHR data. So let's run MultiQC again to generate a report that includes the HISAT2 alignment statistics. Make sure that we change the ~/biostar_class/hbr_uhr/ folder of our home directory and then construct the multiqc command below, where we specify the output filename using --filname. The output will be written in the QC directory. We use "." to tell multiqc to search ~/biostar_class/hbr_uhr for any QC or log files and it will search not only the present working directory but also sub-directories.

cd ~/biostar_class/hbr_uhr

multiqc --filename $hbr_uhr_qc/multiqc_with_hisat2 .

Note that multiqc is now adding the alignment statistics in the report. We can open multiqc_report_1.html to take a look at this addition.

  /// MultiQC 🔍 | v1.13

|           multiqc | Search path : /home/joe/biostar_class/hbr_uhr
|         searching | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 82/82
|            snippy | Found 1 reports
|          bargraph | Tried to make bar plot, but had no data: snippy_variants
|           bowtie2 | Found 6 reports
|            fastqc | Found 14 reports
|           multiqc | Compressing plot data
|           multiqc | Report      : qc/multiqc_with_hisat2.html
|           multiqc | Data        : qc/multiqc_with_hisat2_data
|           multiqc | MultiQC complete

Because we did not specify to overwrite the initial MultiQC report (multiqc_report_1.html), which only contains the pre-alignment quality metrics a new file multiqc_report_1.html was created after running MultiQC. Let's copy this our ~/public folder to take a look.

cp $hbr_uhr_qc/multiqc_with_hisat2.html ~/public

In the navigation pane of the multiqc report, we now see a link to the hisat2 alignment statistics (Figure 4).

Figure 4

In the general statistics table, we now have a column indicating the overall alignment rate for each sample (Figure 5).

Figure 5

We also have a plot (Figure 6) that tells us

How many pair of reads mapped uniquely (ie. only once)
How many pair of reads mapped concordantly to more than one location
How many pair of reads mapped uniquely but discordantly
How many pair of reads did not align
How many pair of reads where one of the pair mapped uniquely
How many pair of reads where one of the pairs were multimapped

This information is the same that we see in the alignment statistics generated by HISAT2, except now we have combined it with our pre-alignment FASTQC reports and this provides a nice way for us to keep the logs and summary of our analysis in one place so we can share with colleagues and collaborators.

Figure 6

If we click on the bar for one of the samples, a dialogue box with alignment statistics appears and we can then see the numbers.

Figure 7

Working with SAM file alignment outputs

Change in the ~/biostar_class/hbr_uhr/hbr_uhr_hisat2 folder for this portion of the class.

cd ~/biostar_class/hbr_uhr/hbr_uhr_hisat2

Now that we have our SAM files generated for the HBR and UHR dataset, let's take a look at the one for HBR_1 (HBR_1.sam) to see what is in it. For your reference, information on the SAM file can be found here. A SAM file is tab delimited, so can opened using Excel.

SAM files always start off with metadata information and these lines start with @.

@HD includes
- SAM file format information (in this case version 1.0 as indicated by VN:1.0)
- Weather the alignment file has been sorted (in this case no as indicated by SO:unsorted)
@SQ provides reference information
- SN denotes reference sequence name, which is chr22
- LN is the reference length in bases, which is 50818468 as we found in Lesson 9 using seqkit stats
@PG provides information about the program we used to generate the alignment
- ID is the program record identifier
- PN is the program name (HISAT2 in our case version 2.2.1 as indicated next to VN)
- CL is the command line call used to generate the alignment

After the metadata information, each SAM file contains 11 fields.

QNAME or query template name, which is essentially the sequencing read that we want mapped onto a reference genome. If we grep the first sequence in the HBR_1.sam, then we should retrieve of the reads in the original FASTQ files (try it on your own).
The second column is a FLAG that tells us a bit about the mapping. Here is a good tool to help interpret these FLAGS. These FLAGS can inform us of things like pair end read alignment orientation.
Column three contains the name of our reference genome.
Column four tells us the left most genomic coordinate where the sequencing read maps.
The mapping quality (MAPQ) is provided in the fifth column (the higher the number, the more confident we are of the map); a value of 255 in this column means that the mapping quality is not available.
Column six presents the CIGAR string, which tells us information about the match/mismatch, insertion/deletion, etc. of the alignment.
Column seven is the reference sequence name of the primary alignment of the NEXT read in the template. We will see an "=" if this name is the same as the current (which we would expect for paired reads)
The alignment position of the next read in the template is provided in column 8. When this not available, the value is set to 0.
Column nine provides the template length (TLEN), which tells us how many bases of the reference genome does the sequencing read span.
The tenth column is just the sequencing read (some are written as the reverse complement so be cautious. The FLAGS in column two will tell us whether the sequence is reverse complemented).
The eleventh column is the Phred quality scores of the sequencing read.
For definition of optional fields, see https://samtools.github.io/hts-specs/.

The SAM file format is human readable so we will need to convert it to a machine readable format (Binary Alignment Map or BAM) before we can visualize alignment results using IGV and perform other downstream processes like obtaining read counts. To convert between SAM and BAM, we can use an application called SAMTOOLS.

If we type samtools on the command line, we can see that this application has lots of features. For an alignment file to be of use, we will need to sort it by genomic position using the sort feature of SAMTOOLS. So we type the following command where the -o flag prompts us to enter the output file name (in this case it will be a BAM file). The last argument is the input SAM file, which is the SAM file that we want to sort.

samtools sort -o HBR_1.bam HBR_1.sam

Now that we have our BAM file for HBR_1 generated, we need to index it. Similar to the idea of indexing a reference genome, indexing the BAM file will allow the program that uses it to more efficient search through it. For this we will use samtools index, where the -b flag tells SAMTOOLS to create the index from a BAM file. We include the extension ".bai" in the output file.

samtools index -b HBR_1.bam HBR_1.bam.bai

The sorting and indexing of our BAM files are perhaps the two necessary steps that allows us to visualize and move forward with our analysis. The above shows how to sort and index one BAM file at a time. Below, let's use the parallel command to take care of these tasks for all of our alignment outputs at one go.

cat $hbr_uhr_reads/ids.txt | parallel "samtools sort -o {}.bam {}.sam"

cat $hbr_uhr_reads/ids.txt | parallel "samtools index -b {}.bam {}.bam.bai"

Alignment of HBR and UHR raw sequencing data with Bowtie2

For RNA sequencing studies, we need to use a splice aware aligner to account for reads that map across exons. Bowtie2 is a commonly used aligner for DNA sequencing and is not splice aware. Let's align the HBR and UHR data to human chromosome 22 using Bowtie2 so we can see how the alignment results differ from HISAT2 when we visualize alignment results in Lesson 14.

To run Bowtie2, we will need to create a Bowtie2 specific index for chromosome 22, so change into the ~/biostar_class/hbr_uhr/refs folder.

cd $hbr_uhr_refs

Then, we use bowtie2-build using 22.fa as input (like we did with HISAT2) and assign to the index the basename of 22.

bowtie2-build 22.fa 22

Listing the contents of our biostar_class/hbr_uhr folder, we see the Bowtie2 indices for chromosome 22 has been created and these have extenison "*.bt2"

ls -1

22.1.bt2
22.1.ht2
22.2.bt2
22.2.ht2
22.3.bt2
22.3.ht2
22.4.bt2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2
22.fa
22.gtf
22.rev.1.bt2
22.rev.2.bt2
ERCC92.fa
ERCC92.gtf

Before running Bowtie2 alignment, creates a folder called hbr_uhr_bowtie2 in the ~/biostar_class/hbr_uhr directory and change into.

cd biostar_class/hbr_uhr

mkdir hbr_uhr_bowtie2

cd hbr_uhr_bowtie2

The construct for command to alignment using Bowtie2 is similar to Hisat2, with the exception that we append "_bowtie2" to the output so that we know that it is from a Bowtie2 alignment. HISAT2 is actually based on Bowtie2 http://daehwankimlab.github.io/hisat2/manual/.

cat $hbr_uhr_reads/ids.txt | parallel "bowtie2 -x $hbr_uhr_refs/22 -1 $hbr_uhr_reads/{}_R1.fq -2 $hbr_uhr_reads/{}_R2.fq -S {}_bowtie2.sam"

And now we will sort the Bowtie2 SAM files and convert to BAM.

cat $hbr_uhr_reads/ids.txt | parallel "samtools sort -o {}_bowtie2.bam {}_bowtie2.sam"

Finally, we will index the the Bowtie2 alignment results.

cat $hbr_uhr_reads/ids.txt | parallel "samtools index -b {}_bowtie2.bam {}_bowtie2.bam.bai"

MultiQC reports with pre-alignment QC and post-alignment statistics

HBR and UHR MultiQC with pre- and post- alignment statistics