19. Short Read Aligners copy

This page uses material directly from the Biostar Handbook by Istvan Albert.

Always remember to activate the bioinformatics environment.

conda activate bioinfo

To align sequences to a genome, we need (1) genome sequence and the (2) sequence reads.

Let's start by creating a new directory for our work and downloading sequence data for the Ebola genome.

mkdir -p refs
efetch -db nuccore -format fasta -id AF086833 > refs/AF086833.fa

Now that we have the genome data, let's build an index with bwa.

bwa index refs/AF086833.fa

We'll need some sequence data files to align to this genome. (Don't put them in the refs directory).

First, we can find data using the "PRJNA", or project number.

esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv

View the "runinfo.csv" file on your machine using MS Excel.

We'll pick one run from this file and download just 10K reads.

Remember to use "--split-files" to get both the forward and reverse reads in separate files.

fastq-dump -X 10000 --split-files SRR1972739

Don't worry if you get an error message while downloading this data, the "fastq-dump" will keep working until it gets all the data.

The bwa program can be run in either single-end or paired-end mode by specifying at the command line.

For single-end mode:

bwa mem refs/AF086833.fa SRR1972739_1.fastq > SRR1972739.sam

and for paired-end mode:

bwa mem refs/AF086833.fa refs/SRR1972739_1.fastq refs/SRR1972739_2.fastq > bwa.sam

Convert sam to bam.

samtools view -S -b bwa.sam > bwa.bam

Sort the bam file

samtools sort bwa.bam -o sorted_bwa.bam

and create the bam index files.

samtools index sorted_bwa.bam

Need .bam and .bam.bai files to see in IGV.

Summary of what we did:

Downloaded Ebola genome in FASTA format (.fa). (AF086833.fa)
Created an index file (.fai) for the Ebola genome using bwa index.
Downloaded RNA-Seq reads in FASTQ (.fastq) format. (SRR1972739_1.fastq, SRR1972739_2.fastq)
Use bwa mem to align the Ebola genome (.fa) and RNA-Seq reads (.fastq), output to SAM (.sam) file.
Convert SAM (.sam) to BAM (.bam), a binary version of SAM (.sam).
Sort the BAM (.bam) files into genome order.
Create a BAM (.bam) index file (.bai), using samtools index.
Load the Ebola genome (AF086833.fa) into IGV (Genomes - Load Genome From File).
Load the BAM (.bam) file into IGV using the "Load File" option (sorted_output.bam).

To use bowtie2 for this analysis, please see Using bowtie2 to align sequence reads