Lesson 14: Visualizing alignment results
Before getting started, remember to be signed on to the DNAnexus GOLD environment.
Lesson 13 Review
Previously, we used the application HISAT2 to align the raw sequencing data from the Human Brain Reference (HBR) and Universal Human Reference (UHR) dataset. We created sorted and indexed alignment output in the form of BAM files that we could use to visualize results in the Integrative Genome Viewer (IGV). We also used the splice unaware Bowtie2 aligner to map the HBR and UHR reads to chromosome 22 and will see how these results differ from HISAT2 when visualizing in IGV.
Learning objectives
In this lesson, we will continue to use the HBR and UHR dataset and focus on learning how to visualize the alignment outputs (both HISAT2 and Bowtie2) in IGV.
Before getting started, change into the ~/biostar_class/hbr_uhr/hbr_uhr_hisat2 directory.
cd ~/biostar_class/hbr_uhr
Creating bigWig files
In Lesson 9, we got a short introduction on what IGV can do. It allows us to visualize genomic data such as reference genomes and how features such as genes and transcripts align to them. A common thing to do after aligning raw sequencing reads is to visually inspect the results in IGV. In this lesson, we will do exactly this. In Lesson 11, we generated the BAM files, here we will generate an additional file that is used to visualize sequencing coverage in IGV. This file format is known as bigWig or bw. bigWig files are binary so not human readable and make visualization faster because the computer only needs to store in memory the content that is needed to be displayed. We will use bedtools and bedGraphToBigWig to generate the bigWig files for the HISAT2 and Bowtie2 alignment of the HBR and UHR dataset.
Creating bigWig files - step 1, convert BAM to bedGraph
Step 1 for generating bigWig files is to convert the BAM alignment results to a bedGraph file that contains coverage along genomic regions.
Enchancing your vocabulary:
- BED file - this is also known as Browser Extensible Format and contains three columns, chromosome, start coordinate and end coordinate -- see Ian's response in this Biostars forum
- bedGraph - this has the same first three columns as the BED file but also has a fourth column that could be anything such as sequencing coverage -- also see Ian's response in this Biostars forum
Example BED file below -- https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
$ cat A.bed
chr1 10 20
chr1 20 30
chr2 0 500
Example bedGraph below -- https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
$ bedtools genomecov -ibam NA18152.bam -bg | head
chr1 554304 554309 5
chr1 554309 554313 6
chr1 554313 554314 1
chr1 554315 554316 6
chr1 554316 554317 5
chr1 554317 554318 1
chr1 554318 554319 2
chr1 554319 554321 6
chr1 554321 554323 1
chr1 554323 554334 7
To do generate a bedGraph file from BAM alignment outputs from the HBR and UHR dataset, we will use bedtools genomecov with the following options
- -ibam prompts us to provide the name of the BAM input file
- -split: when applied to RNA sequencing data, do not count reads that map to introns see this post on why we get reads that map to introns in RNA sequencing
- -bg reports sequencing depth along a genomic interval rather than at each nucleotide position (see Figure 1)
Figure 1: Source: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
Below, we take advantage of the parallel command to convert the BAM files from both HISAT2 and Bowtie2 alignments into bedgraph (bg) files for all of the samples in one go. We use cat to send the sample names, which are contained in a file called ids.txt in the reads folder to parallel. The bedtools command is enclosed in double quotes. The "{}" acts a place holder for the sample IDs sent by cat. We will use the BAM files in the respective alignment output folders (ie. hbr_uhr_hisat2 or hbr_uhr_bowtie2) and then write the output bedGraph file (bg extension) to the respective alignment output folders.
First, create bg files from HISAT2 alignments.
cat reads/ids.txt | parallel "bedtools genomecov -ibam hbr_uhr_hisat2/{}.bam -split -bg > hbr_uhr_hisat2/{}_hisat2.bg"
Next, create bg files from Bowtie2 alignments.
cat reads/ids.txt | parallel "bedtools genomecov -ibam hbr_uhr_bowtie2/{}_bowtie2.bam -split -bg > hbr_uhr_bowtie2/{}_bowtie2.bg"
Below we will look at the content of HBR_1_hisat2.bg sorted by sequencing depth (column 4) from highest to lowest. Hit Q to exit and go back to the prompt. The column command is used to print out tabular data nicely in the Unix terminal. The -t flag tells is how many columns are in the input and we enter the path to the input after that. We pipe the output of the column command to sort where we specify the column to sort by after -k and we want to sort column 4, numerically (n), from highest to lowest (r). We then pipe this output to less -S, which allows us to scroll to see columns that did not fit in the terminal.
column -t hbr_uhr_hisat2/HBR_1_hisat2.bg | sort -k 4nr | less -S
The first column is the chromosome, followed by the genomic coordinates and the sequencing depth.
chr22 42565097 42565102 567
chr22 42565102 42565137 566
chr22 42565137 42565142 565
chr22 42565142 42565167 564
chr22 42565167 42565169 562
chr22 42565096 42565097 531
Creating bigWig files - step 2, create a genome index
To proceed with converting the bedgraph files to bigWig, we need to first create an index of our genome using SAMTOOLS.
samtools faidx refs/22.fa
Listing the contents of our refs directory, we now see an index of the human chromsome 22 genome named 22.fa.fai.
ls refs
Creating bigWig files - step 3, actually creating the bigWig file
After the index file for the genome has been created, we can go ahead and run the following to create the bigWig files for both the HISAT2 and Bowtie2 alignments.
Generate bw files for the HISAT2 alignment.
cat reads/ids.txt | parallel "bedGraphToBigWig hbr_uhr_hisat2/{}_hisat2.bg refs/22.fa.fai hbr_uhr_hisat2/{}_hisat2.bw"
Generate bw files for the Bowtie2 alignment.
cat reads/ids.txt | parallel "bedGraphToBigWig hbr_uhr_bowtie2/{}_bowtie2.bg refs/22.fa.fai hbr_uhr_bowtie2/{}_bowtie2.bw"
Copying alignment results and gtf files to the ~/public directory
Here, we will be using IGV locally on our own computers. We will use the HISAT2 and Bowtie2 alignment results for HBR_1 as an example, so we need to copy the following to the ~/public directory.
cp hbr_uhr_hisat2/HBR_1.bam ~/public
cp hbr_uhr_hisat2/HBR_1.bam.bai ~/public
cp hbr_uhr_bowtie2/HBR_1_bowtie2.bam ~/public
cp hbr_uhr_bowtie2/HBR_1_bowtie2.bam.bai ~/public
cp hbr_uhr_hisat2/HBR_1_hisat2.bw ~/public
cp hbr_uhr_bowtie2/HBR_1_bowtie2.bw ~/public
cp refs/22.fa ~/public
cp refs/22.gtf ~/public
cp refs/22.fa.fai ~/public
Visualizing genomic alignment using IGV
For visualizing the HBR and UHR alignment results, we will use the built in Human hg38 genome. To do this, we will just goto the genome selection box and select hg38 (Figure 2). In Figure 3, we see confirmation that hg38 is loaded.
Figure 2
Figure 3
Next go to "File" and then "Load from File" to load the following alignment results for HBR_1. Let's also load chromosome 22 gtf file (22.gtf) by going to "File" and then "Load from File" (Figure 4).
- HBR_1.bam (alignment from splice aware HISAT2)
- HBR_1_bowtie2.bam (aligment from splice unaware Bowtie2)
- HBR_1_bowtie2.bw (aligment from splice unaware Bowtie2)
- HBR_1.bw (alignment from splice aware HISAT2)
Figure 4
Unfortunately, the entire IGV browser is empty after we loaded the data, with the exception of some coverage at chromosome 22 on the bigWig tracks and the gene tracks from hg38. This is because we aligned our HBR and UHR raw sequencing data only to chromosome 22.
Figure 5
Once we select chr22 (Figure 6), we will begin to see more information in IGV.
Figure 6
Let's now go over what the data tracks are
- On the top (Tracks 1) we have the bigWig or bw tracks for HBR_1 aligned with HISAT2 (HBR_1.bw) or aligned with Bowtie2 (HBR_1_bowtie2.bw).
- Tracks 2 - the BAM alignment for HBR_1 aligned with Bowtie2 (alignment and coverage tracks are separate)
- Tracks 3 - the BAM alignment for HBR_1 aligned with HISAT2 (alignment, splice junction, and coverage tracks are separate)
- On the bottom, we have the gene model track
Figure 7
When the alignment results have loaded, let's goto chr22:41,377,179-41,390,187 (enter this in the search box and select Go) (Figure 6). Things to note are
- The TEF gene occupies this region
- Note that in the HISAT2 alignment track, some of the reads have lines connecting them. These are reads are mapping across exons, so HISAT2, being splice aware can connect these. We do not see this feature in the Bowtie2 alignment.
- The HISAT2 alignment output also has a splice junction track, where the arc height and thickness is proportional to read depth.
- The bigWig (bw) tracks show only coverage information.
Figure 8
We should also note the difference in the coverage histogram between the HBR_1 HISAT2 and Bowtie2 alignments on the bigWig track (chr22:41,377,179-41,390,187).
Figure 9
If we zoom in to chr22:41,387,655, we will see that the coverage track is color partly in blue and partly in red (Figure 10). This is showing a potential variant, where the reference is C (look at the sequences right above the TEF gene model) and some of the samples have T.
Figure 10
If we go back up to the "File" tab in the IGV menu bar, we can select to "Load from Server" (Figure 11) to bring in other information such as SNPs into IGV (Figure 12).
Figure 11
Figure 12
Once we click ok in the menu shown in Figure 12, we will see a SNP track beneath the bigWig tracks. Also, we will find that a SNP record aligns to the location where we found our potential variant.
Figure 13
If we click on the SNP record, a dialog box containing more information about this variant appears.
Figure 14
Note that the presence of an "I" in the subject genome represents an insertion (click on it to see more details). In the example in Figure 15, we have a T insertion.
Figure 15
One of the things that we can do with IGV is to color or group reads according to certain criteria. For example, in paired end sequencing, the orientation of alignment for the pairs can alert us to potential structural variations so one of the things we can do is to go into our alignment (let's use the HBR_1 Bowtie2 alignment) by right clicking on the track, then select "Group alignments by" and then choose "pair orientation" (Figure 16).
Figure 16
Grouping the alignments by pair orientation, we see that the alignment in the HBR_1_Bowtie2.bam track has been separted into two groups - a track labeled "RL" and one labeled "LR" (Figure 17). See https://software.broadinstitute.org/software/igv/interpreting_pair_orientations for the definitions of the read pair orientations. In our example
- The alignments in the LR track are normal
- The alignments in the RL track suggest duplication or translocation when compared to the reference gneome
Figure 17