Lesson 13 Practice

Objectives

In this lesson we learned how to align raw sequencing reads to reference and to process alignment results for downstream analysis. Here, we will test our knowledge by continuing with the Golden Snidget dataset.

Review of what we have done so far for the Golden Snidget dataset

Before we start, remember to change into the ~/biostar_class/snidget folder and then list the content to review what is available.

cd ~/biostar_class/snidget

ls -l QC

total 117384
drwxrwxr-x 1 joe joe      1188 Oct 31 23:29 QC
-rw-r--r-- 1 joe joe     57462 Oct 27 00:30 golden.genome.tar.gz
-rw-r--r-- 1 joe joe 120138017 Oct 27 00:30 golden.reads.tar.gz
drwxr-xr-x 1 joe joe       336 Oct 27 00:30 reads
drwxr-xr-x 1 joe joe        70 Oct 27 00:30 refs

In the ~/biostar_class/snidget folder, we have the tar.gz files for the Golden Snidget reference genome and the sequencing data. These were unpacked previously to give the refs and reads folder. The QC folder contains our FASTQC and MultiQC reports for the unaligned sequencing data.

ls -1

BORED_1_R1_fastqc.html
BORED_1_R1_fastqc.zip
BORED_1_R2_fastqc.html
BORED_1_R2_fastqc.zip
BORED_2_R1_fastqc.html
BORED_2_R1_fastqc.zip
BORED_2_R2_fastqc.html
BORED_2_R2_fastqc.zip
BORED_3_R1_fastqc.html
BORED_3_R1_fastqc.zip
BORED_3_R2_fastqc.html
BORED_3_R2_fastqc.zip
EXCITED_1_R1_fastqc.html
EXCITED_1_R1_fastqc.zip
EXCITED_1_R2_fastqc.html
EXCITED_1_R2_fastqc.zip
EXCITED_2_R1_fastqc.html
EXCITED_2_R1_fastqc.zip
EXCITED_2_R2_fastqc.html
EXCITED_2_R2_fastqc.zip
EXCITED_3_R1_fastqc.html
EXCITED_3_R1_fastqc.zip
EXCITED_3_R2_fastqc.html
EXCITED_3_R2_fastqc.zip
multiqc_report_snidget.html
multiqc_report_snidget_data

Aligning Golden Snidget sequencing - constructing genome index

Here, we are going to align the Golden Snidget sequencing files to it's genome. Recall that we are working with RNA sequencing data. Given HISAT2 and Bowtie2 as the options for aligners, which is more ideal? One of the aligners is splice aware the other is not.

Solution

We should use HISAT2 (the splice aware aligner)

For today, because we are working with alignment, let's create a folder called snidget_hisat2 to store the results for the HISAT2 alignment. Take a moment to create this folder.

Solution

mkdir snidget_hisat2

Next, we need to change into the refs directory. How do we do this?

Solution

cd refs

Besides the raw sequencing files and reference genome, what other information do we need to complete the alignment (hint: it has something to do with the reference genome) and how do we do this?

Solution

We need to index the reference genome

hisat2-build genome.fa genome

Constructing a text file with the Golden Snidget sample IDs

To help us align all of the FASTQ files in one go, we should create in the reads directory a file with the sample IDs names for the Golden Snidget.

First, change into the ~/biostar_class/snidget/reads directory.

cd ~/biostar_class/snidget/reads

Then, how do we use the parallel command to construct the text file with the Golden Snidget sample IDs? Also, save this file as ids.txt.

Solutions

parallel echo {1}_{2} ::: BORED EXCITED ::: 1 2 3 > ids.txt

Listing the contents of the reads directory, we see that we have the file ids.txt.

ls

BORED_1_R1.fq  BORED_2_R1.fq  BORED_3_R1.fq  EXCITED_1_R1.fq  EXCITED_2_R1.fq  EXCITED_3_R1.fq  ids.txt
BORED_1_R2.fq  BORED_2_R2.fq  BORED_3_R2.fq  EXCITED_1_R2.fq  EXCITED_2_R2.fq  EXCITED_3_R2.fq

Now, if we print out the contents of ids.txt using cat, then we see the sample IDs for the Golden Snidget dataset.

cat ids.txt

BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3

Alignment of Golden Snidget FASTQ files - constructing the HISAT2 command

Now that our HISAT2 indices have been built and the text file with the sample IDs has been generated, we can do the actual alignment.

First, change into the ~/biostar_class/snidget/snidget_hisat2 folder

cd ~/biostar_class/snidget/snidget_hisat2

We are going to align all off the FASTQ files for the Golden Snidget in one go. Also, remember as we construct our command to align the FASTQ files that

our reference genome is in the refs folder
the FASTQ files are in the reads folder
the text file (ids.txt) containing the sample IDs are in the reads folder

Remember to save the alignment status as a text file so we can view later (save this as sample-name_hisat2_summary.txt).

Solution

cat ../reads/ids.txt | parallel "hisat2 -x ../refs/genome -1 ../reads/{}_R1.fq -2 ../reads/{}_R2.fq -S {}.sam --summary-file {}_hisat2_summary.txt"

Now that the alignment is done, what are the overall alignment rates? Were there a lot of discordant alignments?

Solution

Overall alignment rate for each sample is 100%. For the most part the reads aligned concordantly.

Can we include the HISAT2 alignment statistics in a MultiQC report? Hint, change back into the ~/biostar_class/snidget directory and save this MultiQC report to the QC folder.

cd ~/biostar_class/snidget

Solution

Yes

multiqc --filename QC/multiqc_report_snidget_post_alignment .

Now listing, the contents of the QC folder we will see the MultiQC report that includes the post alignment statistics (multiqc_report_snidget_post_alignment.html).

ls QC

BORED_1_R1_fastqc.html  BORED_2_R1_fastqc.zip   BORED_3_R2_fastqc.html    EXCITED_1_R2_fastqc.zip   EXCITED_3_R1_fastqc.html     multiqc_report_snidget_data
BORED_1_R1_fastqc.zip   BORED_2_R2_fastqc.html  BORED_3_R2_fastqc.zip     EXCITED_2_R1_fastqc.html  EXCITED_3_R1_fastqc.zip      multiqc_report_snidget_post_alignment.html
BORED_1_R2_fastqc.html  BORED_2_R2_fastqc.zip   EXCITED_1_R1_fastqc.html  EXCITED_2_R1_fastqc.zip   EXCITED_3_R2_fastqc.html     multiqc_report_snidget_post_alignment_data
BORED_1_R2_fastqc.zip   BORED_3_R1_fastqc.html  EXCITED_1_R1_fastqc.zip   EXCITED_2_R2_fastqc.html  EXCITED_3_R2_fastqc.zip
BORED_2_R1_fastqc.html  BORED_3_R1_fastqc.zip   EXCITED_1_R2_fastqc.html  EXCITED_2_R2_fastqc.zip   multiqc_report_snidget.html

Copy multiqc_report_snidget_post_alignment.html to the ~/public directory to take a look.

cp QC/multiqc_report_snidget_post_alignment.html ~/public

Alignment of Golden Snidget FASTQ files - converting SAM files to BAM

Recall in the lesson that SAM files are human readable so we will need to convert these to sorted and indexed BAM files, which are machine readable for downstream steps in our analysis. How do we do this? Remember as we are working, we need to go back to the ~/biostar_class/snidget/snidget_hisat2 directory so change into this before starting this next excersie.

cd ~/biostar_class/snidget/snidget_hisat2

Solution

First, we sort the SAM file and convert it to BAM.

cat ../reads/ids.txt | parallel "samtools sort -o {}.bam {}.sam"

Second, we index the BAM file

cat ../reads/ids.txt | parallel "samtools index -b {}.bam {}.bam.bai"