Lesson 11: Visualizing Genomic Data: Preparing Files

Quick Review

The hcc1395 sequencing data were aligned to reference genome in Lesson 9. This enables the researcher to determine where in the genome each sequence from the Next Generation Sequencing experiment came from.

Learning objectives

Following this lesson, participants will know how to prepare sequencing alignment results for visual inspection using the Integrative Genomics Viewer.

Signing onto Biowulf

Before getting started, sign onto Biowulf.

ssh user@biowulf.nih.gov

Next, change into the /data/user/hcc1395_b4b directory.

cd /data/user/hcc1395_b4b

Finally, request an interactive session with 6 CPUs, 12 gb of RAM or memory and 10 gb of local temporary storage.

sinteractive --cpus-per-task 6 --mem=12gb --gres=lscratch:10

Preparing Alignment Output for Visualization Using IGV

This exercise will create bigWig files from the hcc1395 BAM alignment output. bigWig files have pre-computed sequencing depth information and will enable researchers to see areas in the genome where there are sequences mapped to it and subsequently zoom in on these regions to further interrogate.

Step 1: Convert BAM to bedGraph

The first step for generating bigWig files is to convert the BAM alignment results to bedGraph (with extension bg) files that contains sequencing depth along genomic regions (ie. how many sequences mapped to a genomic region).

Definitions

BED file: this is also known as Browser Extensible Format and contains three columns, which are chromosome, start coordinate and end coordinate -- https://genome.ucsc.edu/FAQ/FAQformat.html#format1

bedGraph: this has the same first three columns as the BED file but also has a fourth column that could be anything such as sequencing coverage -- https://genome.ucsc.edu/goldenPath/help/bedgraph.html

Example BED file is shown below -- source: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

Chromosome  Start End
chr1        10    20
chr1        20    30
chr2        0     500

Example bedGraph is shown below -- source: https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html

Chromosome  Start   End     Depth
chr1        554304  554309  5
chr1        554309  554313  6
chr1        554313  554314  1
chr1        554315  554316  6
chr1        554316  554317  5
chr1        554317  554318  1
chr1        554318  554319  2
chr1        554319  554321  6
chr1        554321  554323  1
chr1        554323  554334  7

The application bedtools will be used to convert the hcc1395 BAM alignment outputs to bedGraph format. bedtools can be used for a range of tasks including compiling information on genomic intervals. The genomecov function of bedtools, which calculates depth over a genomic range with the following options will be used:

-ibam: prompts users to provide the name of the BAM input file(s).
-split: when applied to RNA sequencing data, does not count reads that map to introns (see this post on why we get reads that map to introns in RNA sequencing).
-bg: reports sequencing depth along a genomic interval rather than at each nucleotide position (Figure 1 shows the ways that bedtools can report coverage information - where some options report coverage along an interval, some report at each base position. bedtools also enables the fine tuning of coverage along splice junctions are reported using the split option).

To use bedtools on Biowulf, do:

module load bedtools

To generate bedGraph file for all of the hcc1395 BAM alignments in one go, the command construct below can be used.

cat: used to print the modified sample IDs from the hcc1395 study to the terminal (recall that these new sample IDs not have "hcc1395" at the beginning). The output is sent to parallel using | or pipe.
-j 6 in the parallel command tells it to perform 6 tasks simultaneously.
The bedtools genomecov construct is enclosed in double quotes with parallel:
The input BAM files are located in the folder hcc1395_hisat2 and {} is used as a place holder for the sample ID sent by cat. Thus, the path to the BAM files can be written as hcc1395_hisat2/{}.bam.
> will write the output to the hcc1395_hisat2 folder into a file with sample name sent by cat and .bg extension. Thus, the path to the output is hcc1395_hisat2/{}.bg.

cat hcc1395_sample_ids.txt | parallel -j 6 "bedtools genomecov -ibam hcc1395_hisat2/{}.bam -split -bg > hcc1395_hisat2/{}.bg"

Change into the hcc1395_hisat2 folder after the bedgraph files have been made and use the ls command to ensure that that .bg files are there.

cd hcc1395_hisat2

Next, use head to view the first 10 lines of hcc1395_normal_rep1.bg.

head hcc1395_normal_rep1.bg

The last column in the .bg file is the sequencing depth.

chr22   10564456    10564607    1
chr22   10617976    10618127    1
chr22   10684437    10684486    1
chr22   10684486    10684588    2
chr22   10684588    10684637    1
chr22   10712579    10712719    2
chr22   10712719    10712728    1
chr22   10839950    10840096    2
chr22   10865165    10865298    6
chr22   10940293    10940443    1

Change back to the /data/user/hcc1395_b4b.

cd /data/user/hcc1395_b4b

Step 2: Create an Index for the Genome

To view genomic alignment output using IGV, an index for the reference needs to be created. samtools can be used to create the index. Recall that reference genome is located in the references folder and saved as 22.fa.

module load samtools

samtools faidx references/22.fa

ls -1 references

22.1.ht2
22.2.ht2
22.3.ht2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2
22.fa
22.fa.fai
22.gtf
illumina_multiplex.fa

Step 3: Creating bigWig Files

The final step is to create the bigWig files using the application bedGraphToBigWig from UCSC tools.

First, load ucsc tools into the Biowulf working environment.

module load ucsc

All of the hcc1395 .bg files can be converted to bigWig (.bw extension) at once using the command construct below. The first argument in bedGraphToBigWig is the path to the .bg files, which reside in the folder hcc1395_hisat2 (thus the path in the command is hcc1395_hisat2/{}.bg where {} is a place holder for the sample ID sent by cat). The second argument is the path to the reference index (ie. 22.fa.fai residing in the folder references, thus the path as constructed from the hcc1395_b4b folder is references/22.fa.fai). The final argument is the path to the bigWig file outputs, which will be written to the folder hcc1395_hisat2 and will have .bw extension. The for the output from the hcc1395_b4b folder will be hcc1395_hisat2/{}.bw.

cat hcc1395_sample_ids.txt | parallel -j 6 "bedGraphToBigWig hcc1395_hisat2/{}.bg references/22.fa.fai hcc1395_hisat2/{}.bw"

List the contents of hcc1395_hisat2 to confirm that the .bw files are written.