19. Short Read Aligners copy

This page uses material directly from the Biostar Handbook by Istvan Albert.

Always remember to activate the bioinformatics environment.

conda activate bioinfo

To align sequences to a genome, we need (1) genome sequence and the (2) sequence reads.

Let's start by creating a new directory for our work and downloading sequence data for the Ebola genome.

mkdir -p refs
efetch -db nuccore -format fasta -id AF086833 > refs/AF086833.fa
Now that we have the genome data, let's build an index with bwa.
bwa index refs/AF086833.fa

We'll need some sequence data files to align to this genome. (Don't put them in the refs directory).

First, we can find data using the "PRJNA", or project number.

esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv
View the "runinfo.csv" file on your machine using MS Excel.

We'll pick one run from this file and download just 10K reads.

Remember to use "--split-files" to get both the forward and reverse reads in separate files.

fastq-dump -X 10000 --split-files SRR1972739
Don't worry if you get an error message while downloading this data, the "fastq-dump" will keep working until it gets all the data.

The bwa program can be run in either single-end or paired-end mode by specifying at the command line.

For single-end mode:

bwa mem refs/AF086833.fa SRR1972739_1.fastq > SRR1972739.sam
and for paired-end mode:
bwa mem refs/AF086833.fa refs/SRR1972739_1.fastq refs/SRR1972739_2.fastq > bwa.sam
Convert sam to bam.

samtools view -S -b bwa.sam > bwa.bam
Sort the bam file
samtools sort bwa.bam -o sorted_bwa.bam
and create the bam index files.
samtools index sorted_bwa.bam
Need .bam and .bam.bai files to see in IGV.

Summary of what we did:

  1. Downloaded Ebola genome in FASTA format (.fa). (AF086833.fa)
  2. Created an index file (.fai) for the Ebola genome using bwa index.
  3. Downloaded RNA-Seq reads in FASTQ (.fastq) format. (SRR1972739_1.fastq, SRR1972739_2.fastq)
  4. Use bwa mem to align the Ebola genome (.fa) and RNA-Seq reads (.fastq), output to SAM (.sam) file.
  5. Convert SAM (.sam) to BAM (.bam), a binary version of SAM (.sam).
  6. Sort the BAM (.bam) files into genome order.
  7. Create a BAM (.bam) index file (.bai), using samtools index.
  8. Load the Ebola genome (AF086833.fa) into IGV (Genomes - Load Genome From File).
  9. Load the BAM (.bam) file into IGV using the "Load File" option (sorted_output.bam).

To use bowtie2 for this analysis, please see Using bowtie2 to align sequence reads