Lesson 11: Visualizing Genomic Data: Part 1
Quick Review
The hcc1395 sequencing data were aligned to reference genome in Lesson 9. This enables the researcher to determine where in the genome each sequence from the Next Generation Sequencing experiment came from.
Learning objectives
Following this lesson, participants will know how to prepare sequencing alignment results for visual inspection using the Integrative Genomics Viewer.
Signing onto Biowulf
Before getting started, sign onto Biowulf.
Next, change into the /data/user/hcc1395_b4b
directory.
Finally, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.
Preparing Alignment Output for Visualization Using IGV
This exercise will create bigWig files from the hcc1395 BAM alignment output. bigWig files have pre-computed sequencing depth information and will enable researchers to see areas in the genome where there are sequences mapped to it and subsequently zoom in on these regions to further interrogate.
Step 1: Convert BAM to bedGraph
The first step for generating bigWig files is to convert the BAM alignment results to bedGraph (with extension bg
) files that contains sequencing depth along genomic regions (ie. how many sequences mapped to a genomic region).
Definitions
BED file: this is also known as Browser Extensible Format and contains three columns, which are chromosome, start coordinate and end coordinate -- https://genome.ucsc.edu/FAQ/FAQformat.html#format1
bedGraph: this has the same first three columns as the BED file but also has a fourth column that could be anything such as sequencing coverage -- https://genome.ucsc.edu/goldenPath/help/bedgraph.html
Example BED file is shown below -- source: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
Example bedGraph is shown below -- source: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html
Chromosome Start End Depth
chr1 554304 554309 5
chr1 554309 554313 6
chr1 554313 554314 1
chr1 554315 554316 6
chr1 554316 554317 5
chr1 554317 554318 1
chr1 554318 554319 2
chr1 554319 554321 6
chr1 554321 554323 1
chr1 554323 554334 7
The application bedtools
will be used to convert the hcc1395 BAM alignment outputs to bedGraph format. bedtools
can be used for a range of tasks including compiling information on genomic intervals. The genomecov
function of bedtools
, which calculates depth over a genomic range with the following options will be used:
-ibam
: prompts users to provide the name of the BAM input file(s).-split
: when applied to RNA sequencing data, does not count reads that map to introns (see this post on why we get reads that map to introns in RNA sequencing).-bg
: reports sequencing depth along a genomic interval rather than at each nucleotide position (Figure 1 shows the ways thatbedtools
can report coverage information - where some options report coverage along an interval, some report at each base position.bedtools
also enables the fine tuning of coverage along splice junctions are reported using thesplit
option).
To use bedtools
on Biowulf, do:
To generate bedGraph file for all of the hcc1395 BAM alignments in one go, the command construct below can be used.
cat
: used to print the modified sample IDs from the hcc1395 study to the terminal (recall that these new sample IDs not have "hcc1395" at the beginning). The output is sent toparallel
using|
or pipe.-j 6
in theparallel
command tells it to perform 6 tasks simultaneously.- The
bedtools genomecov
construct is enclosed in double quotes withparallel
: - The input BAM files are located in the folder
hcc1395_hisat2
and{}
is used as a place holder for the sample ID sent bycat
. Thus, the path to the BAM files can be written ashcc1395_hisat2/{}.bam
. >
will write the output to thehcc1395_hisat2
folder into a file with sample name sent bycat
and.bg
extension. Thus, the path to the output ishcc1395_hisat2/{}.bg
.
cat hcc1395_sample_ids.txt | parallel -j 6 "bedtools genomecov -ibam hcc1395_hisat2/{}.bam -split -bg > hcc1395_hisat2/{}.bg"
Change into the hcc1395_hisat2
folder after the bedgraph files have been made and use the ls
command to ensure that that .bg
files are there.
Next, use head
to view the first 10 lines of hcc1395_normal_rep1.bg
.
The last column in the .bg
file is the sequencing depth.
chr22 10564456 10564607 1
chr22 10617976 10618127 1
chr22 10684437 10684486 1
chr22 10684486 10684588 2
chr22 10684588 10684637 1
chr22 10712579 10712719 2
chr22 10712719 10712728 1
chr22 10839950 10840096 2
chr22 10865165 10865298 6
chr22 10940293 10940443 1
Change back to the /data/user/hcc1395_b4b
.
Step 2: Create an Index for the Genome
To view genomic alignment output using IGV, an index for the reference needs to be created. samtools
can be used to create the index. Recall that reference genome is located in the references
folder and saved as 22.fa
.
22.1.ht2
22.2.ht2
22.3.ht2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2
22.fa
22.fa.fai
22.gtf
illumina_multiplex.fa
Step 3: Creating bigWig Files
The final step is to create the bigWig files using the application bedGraphToBigWig
from UCSC tools.
First, load ucsc
tools into the Biowulf working environment.
All of the hcc1395 .bg
files can be converted to bigWig (.bw
extension) at once using the command construct below. The first argument in bedGraphToBigWig
is the path to the .bg
files, which reside in the folder hcc1395_hisat2
(thus the path in the command is hcc1395_hisat2/{}.bg
where {}
is a place holder for the sample ID sent by cat
). The second argument is the path to the reference index (ie. 22.fa.fai
residing in the folder references
, thus the path as constructed from the hcc1395_b4b
folder is references/22.fa.fai
). The final argument is the path to the bigWig file outputs, which will be written to the folder hcc1395_hisat2
and will have .bw
extension. The for the output from the hcc1395_b4b
folder will be hcc1395_hisat2/{}.bw
.
cat hcc1395_sample_ids.txt | parallel -j 6 "bedGraphToBigWig hcc1395_hisat2/{}.bg references/22.fa.fai hcc1395_hisat2/{}.bw"
List the contents of hcc1395_hisat2
to confirm that the .bw
files are written.