05. Using bowtie2 to align sequence reads copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Always remember to activate the bioinfo environment when working on Biostar class material.

conda activate bioinfo

Retrieving a FASTA genome from NCBI/GenBank and creating an index file.

Go to your biostar_class directory.

cd biostar_class

Create a new directory named "refs" and go to that directory.

mkdir refs 
cd refs

Use the following command to retrieve data from NCBI.

efetch -db nuccore -format fasta -id AF086833 > AF0868333.fa

A few questions about this command line... 1. What database are we searching? 2. What format is the data we are retrieving? 3. What is the species of "AF086833"? How would you find out?

Now, check to see if the output has been created.

ls -h

samtools is a set of utilities for working with short sequencing reads in SAM/BAM/CRAM format. "faidx" creates an index file from FASTA (.fa) formatted data.

samtools faidx AF086833.fa

To view the output files...

ls
AF086833.fa  AF08633.fa.fai

Using bowtie2 to align DNA sequence files to genome.

The bowtie2-build indexer builds a Bowtie index from a set of DNA sequences ([ref]. "bowtie2-build" builds a Bowtie index from a set of DNA sequences. "bowtie2-build" outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.(http://gensoft.pasteur.fr/docs/bowtie2/2.0.0/#the-bowtie2-build-indexer).).

Command line usage:

bowtie2-build [options]* <reference_in> <bt2_base>

where is a comma-separated list of FASTA files containing the reference sequences to be aligned, or the sequences themselves (using -c option). and is the basename of the index files to write. These will be named $NAME.1.bt2, $NAME.2.bt2, $NAME.rev.1.bt2, $NAME.rev.2.bt2, etc. option (-f) specifies the reference input files. Bowtie2-build has many other options.

bowtie2-build --help

Specify the input and output PREFIX names on the command line (these might be the same). First is the reference file, second is the prefix name of the index (keep it the same or similar to the actual reference.)

bowtie2-build AF086833.fa AF086833

What did that do?

ls -lh

gives something like this (ignore default.profraw file it has nothing to do with bowtie or samtools)

total 18336
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.0M May  4 14:57 AF086833.1.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.6K May  4 14:57 AF086833.2.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users    17B May  4 14:57 AF086833.3.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.6K May  4 14:57 AF086833.4.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users    19K May  4 14:31 AF086833.fa
-rw-r--r--  1 stonelakeak  NIH\Domain Users    26B May  4 14:31 AF086833.fa.fai
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.0M May  4 14:57 AF086833.rev.1.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.6K May  4 14:57 AF086833.rev.2.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   905K May  4 14:57 default.profraw

Now that we have the reference indices, we can do the bowtie2 alignment. But first, we'll need some sequencing reads. Off to the SRA (Sequence Read Archive)! (NCBI)

fastq-dump -X 10000 --split-files SRR1972739

fastq-dump is part of the SRA toolkit, that allows programmatic access to data within SRA and converts it from SRA format (to FASTA, FASTQ, etc.). It connects to NCBI and downloads files via FTP (file transfer protocol).

In this command, the SRA file is "SRR1972739".

The option "-X 10000" prints the first 10000 spots (reads).

The option "--split files" produces "_1.fq" and "_2.fq" files representing paired-end data.

ls -l
total 28880
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.0M May  4 14:57 AF086833.1.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.6K May  4 14:57 AF086833.2.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users    17B May  4 14:57 AF086833.3.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.6K May  4 14:57 AF086833.4.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users    19K May  4 14:31 AF086833.fa
-rw-r--r--  1 stonelakeak  NIH\Domain Users    26B May  4 14:31 AF086833.fa.fai
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.0M May  4 14:57 AF086833.rev.1.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   4.6K May  4 14:57 AF086833.rev.2.bt2
-rw-r--r--  1 stonelakeak  NIH\Domain Users   2.6M May  4 15:02 SRR1972739_1.fastq
-rw-r--r--  1 stonelakeak  NIH\Domain Users   2.6M May  4 15:02 SRR1972739_2.fastq
-rw-r--r--  1 stonelakeak  NIH\Domain Users   905K May  4 14:57 default.profraw

Now we can do the bowtie alignment

bowtie2 -x AF086833 -1 SRR1972739_1.fastq -2 SRR1972739_2.fastq > sample.sam

What is this command line doing?

"-x option" specifies the index files (prefix)

"-1" calls the forward reads (_1.fastq)

"-2" calls the reverse reads (_2.fastq)

"> sample.sam" prints the output to a file named "sample.sam" in "sam" format

Prints the following output to the screen.

10000 reads; of these:
  10000 (100.00%) were paired; of these:
    5086 (50.86%) aligned concordantly 0 times
    4914 (49.14%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    5086 pairs aligned concordantly 0 times; of these:
      827 (16.26%) aligned discordantly 1 time
    ----
    4259 pairs aligned 0 times concordantly or discordantly; of these:
      8518 mates make up the pairs; of these:
        7463 (87.61%) aligned 0 times
        1055 (12.39%) aligned exactly 1 time
        0 (0.00%) aligned >1 times
62.69% overall alignment rate

Where "sample.sam" is the alignment file.

Convert sam to bam using samtools utilities. (bam is a binary format of sam and is much easier for the computers to work with)

"view" option views and converts SAM/BAM/CRAM files.

samtools view -S -b sample.sam > sample.bam

"-S" (input is in sam format)

"-b" (output to bam format)

"> sample.bam" prints the output to a file named "sample.bam"

The ".bam" file needs to be sorted so that the alignments occur in "genome order", ordered positionally based on their alignment coordinates on each chromosome.

We can use samtools "sort" to sort the sample.bam file and output to sample.sorted.bam

samtools sort sample.bam -o sample.sorted.bam

Here is another samtools utility, the "index" option, used to create the bam index file (.bai)

samtools index sample.sorted.bam
ls

Now that we have the .bam file, and the .bai index file, we can look at these reads in IGV. The .bai file does not need to be loaded into IGV, it just needs to be present in the same directory as the .bam file

Summary of what we did: 1. Downloaded the Ebola virus genome in FASTA format (.fa). 2. Created an index file (.fai) for the genome using samtools (faidx). 3. Used "bowtie2-build" to create bowtie index files for the Ebola virus genome. 4. Downloaded 10,000 SRA reads for the Ebola virus genome in FASTQ (.fq) format. 5. Use bowtie2 to align the genome FASTA (.fa) and SRA reads in FASTQ (.fq), output to SAM (.sam) file. 6. Convert SAM (.sam) to BAM (.bam), a binary version of SAM (.sam). 7. Sort the BAM (.bam) files into genome order. 8. Create a BAM (.bam) index file (.bai), using samtools index. 9. Load the BAM (.bam) file into IGV using the "Load File" option (sample.sorted.bam).

Which samtools commands did we use? * samtools faidx -> create index file from FASTA(.fa) file. * samtools view -> look at content of .bam file, and transform SAM (.sam) files into BAM (.bam) files. * samtools sort -> sort the BAM (.bam) files into genome order. * samtools index -> create index file from sorted BAM (.bam) file.

IGV -> Load File -> sample.sorted.bam

Screen Shot 2020-05-05 at 4 00 25 PM