Lesson 14: Visualizing alignment results
Before getting started, remember to be signed on to the DNAnexus GOLD environment.
Lesson 13 Review
Previously, we used the application HISAT2 to align the raw sequencing data from the Human Brain Reference (HBR) and Universal Human Reference (UHR) dataset. We created sorted and indexed alignment output in the form of BAM files that we could use to visualize results in the Integrative Genome Viewer (IGV). We also used the splice unaware Bowtie2 aligner to map the HBR and UHR reads to chromosome 22 and will see how these results differ from HISAT2 when visualizing in IGV.
Learning objectives
In this lesson, we will continue to use the HBR and UHR dataset and focus on learning how to visualize the alignment outputs (both HISAT2 and Bowtie2) in IGV.
Creating bigWig files
Due to time constraints, we will not be going over the creation of bigWig files in class. The information below is for your reference and you can view these during your spare time.
In Lesson 9, we got a short introduction on what IGV can do. It allows us to visualize genomic data such as reference genomes and how features such as genes and transcripts align to them. A common thing to do after aligning raw sequencing reads is to visually inspect the results in IGV. In this lesson, we will do exactly this. In Lesson 11, we generated the BAM files, here we will generate an additional file that is used to visualize sequencing coverage in IGV. This file format is known as bigWig or bw. bigWig files are binary so not human readable and make visualization faster because the computer only needs to store in memory the content that is needed to be displayed. We will use bedtools and bedGraphToBigWig to generate the bigWig files for the HISAT2 and Bowtie2 alignment of the HBR and UHR dataset.
Be sure to change into the ~/biostar_class/hbr_uhr folder for this portion of the class.
cd ~/biostar_class/hbr_uhr
Creating bigWig files - step 1, convert BAM to bedGraph
Step 1 for generating bigWig files is to convert the BAM alignment results to a bedGraph (with extension bg) file that contains coverage along genomic regions.
Enhancing your vocabulary:
- BED file - this is also known as Browser Extensible Format and contains three columns, which are chromosome, start coordinate and end coordinate -- see Ian's response in this Biostars forum
- bedGraph - this has the same first three columns as the BED file but also has a fourth column that could be anything such as sequencing coverage -- also see Ian's response in this Biostars forum
Example BED file below -- https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
$ cat A.bed
chr1 10 20
chr1 20 30
chr2 0 500
Example bedGraph below -- https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
$ bedtools genomecov -ibam NA18152.bam -bg | head
chr1 554304 554309 5
chr1 554309 554313 6
chr1 554313 554314 1
chr1 554315 554316 6
chr1 554316 554317 5
chr1 554317 554318 1
chr1 554318 554319 2
chr1 554319 554321 6
chr1 554321 554323 1
chr1 554323 554334 7
To generate a bedGraph file from BAM alignment outputs from the HBR and UHR dataset, we will use an application called bedtools, which can be used for a range of tasks including compiling information on genomic intervals. We will use it's genomecov, which calculates coverage over a genomic range with the following options:
- -ibam: prompts us to provide the name of the BAM input file
- -split: when applied to RNA sequencing data, do not count reads that map to introns see this post on why we get reads that map to introns in RNA sequencing
- -bg: reports sequencing depth along a genomic interval rather than at each nucleotide position (See Figure 1, shows the ways we can get bedtools to output coverage information - where some options report coverage along an interval, some report at each base position. Bedtools also gives us the ability to fine tune how we count coverage along splice junctions with the split option)
Figure 1: Source: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
Below, we take advantage of the parallel command to convert the BAM files from both HISAT2 and Bowtie2 alignments into bedGraph (bg) files for all of the samples in one go.
- cat reads/ids.txt: this will read the HBR and UHR sample IDs that are stored in a file called ids.txt in the reads folder of the ~/biostar_class/hbr_uhr directory
- |: will pipe or send the output of cat to the next command, which is parallel
- we enclose the command that we want parallel to operate on in double quotes; here we are using bedtools and its genomecov subcommand, where the parameters -ibam, -split, and -bg were explained above
- hbr_uhr_hisat2/{}.bam: this is the path to our BAM file generated from aligning the HBR and UHR FASTQ files to genome
- hbr_uhr_hisat2 is the output directory for the HISAT2 alignment (hbr_uhr_bowtie2 is the output directory for the Bowtie2 alignment)
- we use {} as a place holder to receive the HBR and UHR sample IDs from cat (ie. HBR_1, HBR_2, HBR_3, UHR_1, UHR_2, UHR3) and these IDs are appended by .bam to complete the full path of the input BAM file
- hbr_uhr_hisat2/{}_hisat2.bg:
- we write the bedGraph output to the hbr_uhr_hisat2 for the HISAT2 alignment (hbr_uhr_bowtie2 for the Bowtie2 alignment)
- again, {} acts as a place holder to receive the HBR and UHR sample IDs from cat, and the sample IDs are appended with _hisat2.bg for the HISAT2 alignment ouput (or _bowtie2.bg for Bowtie2 alignment output) to complete the full path of the output bedGraph (bg) file
First, create bg files from HISAT2 alignments.
cat reads/ids.txt | parallel "bedtools genomecov -ibam hbr_uhr_hisat2/{}.bam -split -bg > hbr_uhr_hisat2/{}_hisat2.bg"
Next, create bg files from Bowtie2 alignments.
cat reads/ids.txt | parallel "bedtools genomecov -ibam hbr_uhr_bowtie2/{}_bowtie2.bam -split -bg > hbr_uhr_bowtie2/{}_bowtie2.bg"
Below we will look at the content of HBR_1_hisat2.bg sorted by sequencing depth (column 4) from highest to lowest using the column command.
- column: prints out tabular data nicely in the terminal
- -t: figures out how many columns there are in the input, which is HBR_1_hisat2.bg in the folder hbr_uhr_hisat2 (full input path: hbr_uhr_hisat2/HBR_1_hisat2.bg)
- |: pipes or sends the output from column to sort, where
- -k: allows us to specify the column to sort by, in this case we want to sort by column 4 (the coverage) and we want to sort numerically (n) and from highest to lowest (r) with the final construct of 4nr
- |: pipes or sends the sort output to less where -S allows us to scroll horizontally in the terminal to see columns that did not fit
column -t hbr_uhr_hisat2/HBR_1_hisat2.bg | sort -k 4nr | less -S
The first column is the chromosome, followed by the genomic coordinates and the sequencing depth.
chr22 42565097 42565102 567
chr22 42565102 42565137 566
chr22 42565137 42565142 565
chr22 42565142 42565167 564
chr22 42565167 42565169 562
chr22 42565096 42565097 531
Creating bigWig files - step 2, create a genome index
To proceed with converting the bedGraph files to bigWig, we need to first create an index of our genome using SAMTOOLS and it's faidx feature. Where faidx will index/extract FASTA.
samtools faidx refs/22.fa
Listing the contents of our refs directory, we now see an index of the human chromosome 22 genome named 22.fa.fai.
ls refs
Creating bigWig files - step 3, actually creating the bigWig files
After the index file for the genome has been created, we will use a tool called bedGraphToBigWig to generate bigWig (bw) files from bedGraph (bg).
Again, we use cat and parallel where
- cat reads/ids.txt: this will read the HBR and UHR sample IDs that are stored in a file called ids.txt in the reads folder of the ~/biostar_class/hbr_uhr directory
- |: will pipe or send the output of cat to the next command, which is parallel
- we enclose the command that we want parallel to operate on in double quotes; here we are using bedGraphToBigWig
- hbr_uhr_hisat2/{}_hisat2.bg: this is the path to our bg (bedGraph) file generated using bedtools genomecov
- hbr_uhr_hisat2 is the output directory for the HISAT2 alignment (hbr_uhr_bowtie2 is the output directory for the Bowtie2 alignment)
- we use {} as a place holder to receive the HBR and UHR sample IDs from cat (ie. HBR_1, HBR_2, HBR_3, UHR_1, UHR_2, UHR3) and these IDs are appended by .bg to complete the full path of the input BAM file
- refs/22.fa.fai: path for index of the human chromosome 22 reference
- hbr_uhr_hisat2/{}_hisat2.bw:
- we write the bigWig output to the hbr_uhr_hisat2 for the HISAT2 alignment (hbr_uhr_bowtie2 for the Bowtie2 alignment)
- again, {} acts as a place holder to receive the HBR and UHR sample IDs from cat, and the sample IDs are appended with _hisat2.bw for the HISAT2 alignment ouput (or _bowtie2.bw for Bowtie2 alignment output) to complete the full path of the output bigWig (bw) file
Generate bigWig files for the HISAT2 alignment.
cat reads/ids.txt | parallel "bedGraphToBigWig hbr_uhr_hisat2/{}_hisat2.bg refs/22.fa.fai hbr_uhr_hisat2/{}_hisat2.bw"
Generate bigWig files for the Bowtie2 alignment.
cat reads/ids.txt | parallel "bedGraphToBigWig hbr_uhr_bowtie2/{}_bowtie2.bg refs/22.fa.fai hbr_uhr_bowtie2/{}_bowtie2.bw"
Visualizing alignment HBR and UHR alignment results with IGV
In this portion of the class, it is very important that you have IGV already opened on your computer. See Figure 1 and Figure 2 on how to load the relevant alignment outputs for HBR_1 and UHR_1 into IGV.
Figure 1: Click on the BAM_BW.html under All Projects -> BioStars to access the IGV launcher for the HBR and UHR dataset.
Figure 2: Click the launch button to view the alignments.
We can also select the reference genome to use in the genome selection drop down menu. But here, we will stick with hg38 (Figure 3).
Figure 3
We can also load local data to IGV by going to "File" and then "Load from File" (Figure 4). But the alignment output has already been pre-loaded for us, so we will not need to load data from local.
Figure 4
Unfortunately, the entire IGV browser is empty after we loaded the data, with the exception of some coverage at chromosome 22 on the bigWig tracks from hg38. This is because we aligned our HBR and UHR raw sequencing data only to chromosome 22.
Figure 5
Once we select chr22 (Figure 6), we will begin to see more information in IGV.
Figure 6
Let's now go over what the data tracks are
- On the top (Tracks 1) we have the bigWig or bw tracks for HBR_1 aligned with HISAT2 (HBR_1.bw) or aligned with Bowtie2 (HBR_1_bowtie2.bw).
- Tracks 2 - the BAM alignment for HBR_1 aligned with Bowtie2 (alignment and coverage tracks are separate)
- Tracks 3 - the BAM alignment for HBR_1 aligned with HISAT2 (alignment, splice junction, and coverage tracks are separate)
- On the bottom, we have the gene model track
Figure 7
When the alignment results have loaded, let's go to chr22:41,377,179-41,390,187 (enter this in the search box and select Go) (Figure 6). Things to note are
- The TEF gene occupies this region
- Note that in the HISAT2 alignment track, some of the reads have lines connecting them. These reads are mapping across exons, so HISAT2, being splice aware can connect these. We do not see this feature in the Bowtie2 alignment.
- The HISAT2 alignment output also has a splice junction track, where the arc height and thickness is proportional to read depth.
- The bigWig (bw) tracks show only coverage information.
Figure 8
We should also note the difference in the coverage histogram between the HBR_1 HISAT2 and Bowtie2 alignments on the bigWig track (chr22:41,377,179-41,390,187).
Figure 9
If we zoom in to chr22:41,387,655, we will see that the coverage track is color partly in blue and partly in red (Figure 10). This is showing a potential variant, where the reference is C (look at the sequences right above the TEF gene model) and some of the samples have T.
Figure 10
If we go back up to the "File" tab in the IGV menu bar, we can select to "Load from Server" (Figure 11) to bring in other information such as SNPs into IGV (Figure 12).
Figure 11
Figure 12
Once we click ok in the menu shown in Figure 12, we will see a SNP track beneath the bigWig tracks. Also, we will find that a SNP record aligns to the location where we found our potential variant.
Figure 13
If we click on the SNP record, a dialog box containing more information about this variant appears.
Figure 14
Note that the presence of an "I" in the subject genome represents an insertion (click on it to see more details). In the example in Figure 15, we have a T insertion.
Figure 15
One of the things that we can do with IGV is to color or group reads according to certain criteria. For example, in paired end sequencing, the orientation of alignment for the pairs can alert us to potential structural variations so one of the things we can do is to go into our alignment (let's use the HBR_1 Bowtie2 alignment) by right clicking on the track, then select "Group alignments by" and then choose "pair orientation" (Figure 16).
Figure 16
Grouping the alignments by pair orientation, we see that the alignment in the HBR_1_Bowtie2.bam track has been separated into two groups - a track labeled "RL" and one labeled "LR" (Figure 17). See https://software.broadinstitute.org/software/igv/interpreting_pair_orientations for the definitions of the read pair orientations. In our example
- The alignments in the LR track are normal
- The alignments in the RL track suggest duplication or translocation when compared to the reference geneome
Figure 17