05. Using bowtie2 to align sequence reads copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Always remember to activate the bioinfo environment when working on Biostar class material.
conda activate bioinfo
Retrieving a FASTA genome from NCBI/GenBank and creating an index file.
Go to your biostar_class directory.
cd biostar_class
Create a new directory named "refs" and go to that directory.
mkdir refs
cd refs
efetch -db nuccore -format fasta -id AF086833 > AF0868333.fa
Now, check to see if the output has been created.
ls -h
samtools is a set of utilities for working with short sequencing reads in SAM/BAM/CRAM format. "faidx" creates an index file from FASTA (.fa) formatted data.
samtools faidx AF086833.fa
ls
AF086833.fa AF08633.fa.fai
Using bowtie2 to align DNA sequence files to genome.
The bowtie2-build indexer builds a Bowtie index from a set of DNA sequences ([ref]. "bowtie2-build" builds a Bowtie index from a set of DNA sequences. "bowtie2-build" outputs a set of 6 files with suffixes .1.bt2, .2.bt2, .3.bt2, .4.bt2, .rev.1.bt2, and .rev.2.bt2. These files together constitute the index: they are all that is needed to align reads to that reference. The original sequence FASTA files are no longer used by Bowtie 2 once the index is built.(http://gensoft.pasteur.fr/docs/bowtie2/2.0.0/#the-bowtie2-build-indexer).).
Command line usage:
bowtie2-build [options]* <reference_in> <bt2_base>
bowtie2-build --help
bowtie2-build AF086833.fa AF086833
ls -lh
total 18336
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.0M May 4 14:57 AF086833.1.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.6K May 4 14:57 AF086833.2.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 17B May 4 14:57 AF086833.3.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.6K May 4 14:57 AF086833.4.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 19K May 4 14:31 AF086833.fa
-rw-r--r-- 1 stonelakeak NIH\Domain Users 26B May 4 14:31 AF086833.fa.fai
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.0M May 4 14:57 AF086833.rev.1.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.6K May 4 14:57 AF086833.rev.2.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 905K May 4 14:57 default.profraw
Now that we have the reference indices, we can do the bowtie2 alignment. But first, we'll need some sequencing reads. Off to the SRA (Sequence Read Archive)! (NCBI)
fastq-dump -X 10000 --split-files SRR1972739
In this command, the SRA file is "SRR1972739".
The option "-X 10000" prints the first 10000 spots (reads).
The option "--split files" produces "_1.fq" and "_2.fq" files representing paired-end data.
ls -l
total 28880
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.0M May 4 14:57 AF086833.1.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.6K May 4 14:57 AF086833.2.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 17B May 4 14:57 AF086833.3.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.6K May 4 14:57 AF086833.4.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 19K May 4 14:31 AF086833.fa
-rw-r--r-- 1 stonelakeak NIH\Domain Users 26B May 4 14:31 AF086833.fa.fai
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.0M May 4 14:57 AF086833.rev.1.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 4.6K May 4 14:57 AF086833.rev.2.bt2
-rw-r--r-- 1 stonelakeak NIH\Domain Users 2.6M May 4 15:02 SRR1972739_1.fastq
-rw-r--r-- 1 stonelakeak NIH\Domain Users 2.6M May 4 15:02 SRR1972739_2.fastq
-rw-r--r-- 1 stonelakeak NIH\Domain Users 905K May 4 14:57 default.profraw
bowtie2 -x AF086833 -1 SRR1972739_1.fastq -2 SRR1972739_2.fastq > sample.sam
"-x option" specifies the index files (prefix)
"-1" calls the forward reads (_1.fastq)
"-2" calls the reverse reads (_2.fastq)
"> sample.sam" prints the output to a file named "sample.sam" in "sam" format
Prints the following output to the screen.
10000 reads; of these:
10000 (100.00%) were paired; of these:
5086 (50.86%) aligned concordantly 0 times
4914 (49.14%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
5086 pairs aligned concordantly 0 times; of these:
827 (16.26%) aligned discordantly 1 time
----
4259 pairs aligned 0 times concordantly or discordantly; of these:
8518 mates make up the pairs; of these:
7463 (87.61%) aligned 0 times
1055 (12.39%) aligned exactly 1 time
0 (0.00%) aligned >1 times
62.69% overall alignment rate
Convert sam to bam using samtools utilities. (bam is a binary format of sam and is much easier for the computers to work with)
"view" option views and converts SAM/BAM/CRAM files.
samtools view -S -b sample.sam > sample.bam
"-b" (output to bam format)
"> sample.bam" prints the output to a file named "sample.bam"
The ".bam" file needs to be sorted so that the alignments occur in "genome order", ordered positionally based on their alignment coordinates on each chromosome.
We can use samtools "sort" to sort the sample.bam file and output to sample.sorted.bam
samtools sort sample.bam -o sample.sorted.bam
samtools index sample.sorted.bam
ls
Summary of what we did: 1. Downloaded the Ebola virus genome in FASTA format (.fa). 2. Created an index file (.fai) for the genome using samtools (faidx). 3. Used "bowtie2-build" to create bowtie index files for the Ebola virus genome. 4. Downloaded 10,000 SRA reads for the Ebola virus genome in FASTQ (.fq) format. 5. Use bowtie2 to align the genome FASTA (.fa) and SRA reads in FASTQ (.fq), output to SAM (.sam) file. 6. Convert SAM (.sam) to BAM (.bam), a binary version of SAM (.sam). 7. Sort the BAM (.bam) files into genome order. 8. Create a BAM (.bam) index file (.bai), using samtools index. 9. Load the BAM (.bam) file into IGV using the "Load File" option (sample.sorted.bam).
Which samtools commands did we use? * samtools faidx -> create index file from FASTA(.fa) file. * samtools view -> look at content of .bam file, and transform SAM (.sam) files into BAM (.bam) files. * samtools sort -> sort the BAM (.bam) files into genome order. * samtools index -> create index file from sorted BAM (.bam) file.
IGV -> Load File -> sample.sorted.bam