Lesson 13 Practice
Objectives
In this lesson we learned how to align raw sequencing reads to reference and to process alignment results for downstream analysis. Here, we will test our knowledge by continuing with the Golden Snidget dataset.
Review of what we have done so far for the Golden Snidget dataset
Before we start, remember to change into the ~/biostar_class/snidget folder and then list the content to review what is available.
cd ~/biostar_class/snidget
ls -l QC
total 117384
drwxrwxr-x 1 joe joe 1188 Oct 31 23:29 QC
-rw-r--r-- 1 joe joe 57462 Oct 27 00:30 golden.genome.tar.gz
-rw-r--r-- 1 joe joe 120138017 Oct 27 00:30 golden.reads.tar.gz
drwxr-xr-x 1 joe joe 336 Oct 27 00:30 reads
drwxr-xr-x 1 joe joe 70 Oct 27 00:30 refs
In the ~/biostar_class/snidget folder, we have the tar.gz files for the Golden Snidget reference genome and the sequencing data. These were unpacked previously to give the refs and reads folder. The QC folder contains our FASTQC and MultiQC reports for the unaligned sequencing data.
ls -1
BORED_1_R1_fastqc.html
BORED_1_R1_fastqc.zip
BORED_1_R2_fastqc.html
BORED_1_R2_fastqc.zip
BORED_2_R1_fastqc.html
BORED_2_R1_fastqc.zip
BORED_2_R2_fastqc.html
BORED_2_R2_fastqc.zip
BORED_3_R1_fastqc.html
BORED_3_R1_fastqc.zip
BORED_3_R2_fastqc.html
BORED_3_R2_fastqc.zip
EXCITED_1_R1_fastqc.html
EXCITED_1_R1_fastqc.zip
EXCITED_1_R2_fastqc.html
EXCITED_1_R2_fastqc.zip
EXCITED_2_R1_fastqc.html
EXCITED_2_R1_fastqc.zip
EXCITED_2_R2_fastqc.html
EXCITED_2_R2_fastqc.zip
EXCITED_3_R1_fastqc.html
EXCITED_3_R1_fastqc.zip
EXCITED_3_R2_fastqc.html
EXCITED_3_R2_fastqc.zip
multiqc_report_snidget.html
multiqc_report_snidget_data
Aligning Golden Snidget sequencing - constructing genome index
Here, we are going to align the Golden Snidget sequencing files to it's genome. Recall that we are working with RNA sequencing data. Given HISAT2 and Bowtie2 as the options for aligners, which is more ideal? One of the aligners is splice aware the other is not.
Solution
We should use HISAT2 (the splice aware aligner)
For today, because we are working with alignment, let's create a folder called snidget_hisat2 to store the results for the HISAT2 alignment. Take a moment to create this folder.
Solution
mkdir snidget_hisat2
Next, we need to change into the refs directory. How do we do this?
Solution
cd refs
Besides the raw sequencing files and reference genome, what other information do we need to complete the alignment (hint: it has something to do with the reference genome) and how do we do this?
Solution
We need to index the reference genome
hisat2-build genome.fa genome
Constructing a text file with the Golden Snidget sample IDs
To help us align all of the FASTQ files in one go, we should create in the reads directory a file with the sample IDs names for the Golden Snidget.
First, change into the ~/biostar_class/snidget/reads directory.
cd ~/biostar_class/snidget/reads
Then, how do we use the parallel command to construct the text file with the Golden Snidget sample IDs? Also, save this file as ids.txt.
Solutions
parallel echo {1}_{2} ::: BORED EXCITED ::: 1 2 3 > ids.txt
Listing the contents of the reads directory, we see that we have the file ids.txt.
ls
BORED_1_R1.fq BORED_2_R1.fq BORED_3_R1.fq EXCITED_1_R1.fq EXCITED_2_R1.fq EXCITED_3_R1.fq ids.txt
BORED_1_R2.fq BORED_2_R2.fq BORED_3_R2.fq EXCITED_1_R2.fq EXCITED_2_R2.fq EXCITED_3_R2.fq
Now, if we print out the contents of ids.txt using cat, then we see the sample IDs for the Golden Snidget dataset.
cat ids.txt
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
Alignment of Golden Snidget FASTQ files - constructing the HISAT2 command
Now that our HISAT2 indices have been built and the text file with the sample IDs has been generated, we can do the actual alignment.
First, change into the ~/biostar_class/snidget/snidget_hisat2 folder
cd ~/biostar_class/snidget/snidget_hisat2
We are going to align all off the FASTQ files for the Golden Snidget in one go. Also, remember as we construct our command to align the FASTQ files that
- our reference genome is in the refs folder
- the FASTQ files are in the reads folder
- the text file (ids.txt) containing the sample IDs are in the reads folder
Remember to save the alignment status as a text file so we can view later (save this as sample-name_hisat2_summary.txt).
Solution
cat ../reads/ids.txt | parallel "hisat2 -x ../refs/genome -1 ../reads/{}_R1.fq -2 ../reads/{}_R2.fq -S {}.sam --summary-file {}_hisat2_summary.txt"
Now that the alignment is done, what are the overall alignment rates? Were there a lot of discordant alignments?
Solution
Overall alignment rate for each sample is 100%. For the most part the reads aligned concordantly.
Can we include the HISAT2 alignment statistics in a MultiQC report? Hint, change back into the ~/biostar_class/snidget directory and save this MultiQC report to the QC folder.
cd ~/biostar_class/snidget
Solution
Yes
multiqc --filename QC/multiqc_report_snidget_post_alignment .
Now listing, the contents of the QC folder we will see the MultiQC report that includes the post alignment statistics (multiqc_report_snidget_post_alignment.html).
ls QC
BORED_1_R1_fastqc.html BORED_2_R1_fastqc.zip BORED_3_R2_fastqc.html EXCITED_1_R2_fastqc.zip EXCITED_3_R1_fastqc.html multiqc_report_snidget_data
BORED_1_R1_fastqc.zip BORED_2_R2_fastqc.html BORED_3_R2_fastqc.zip EXCITED_2_R1_fastqc.html EXCITED_3_R1_fastqc.zip multiqc_report_snidget_post_alignment.html
BORED_1_R2_fastqc.html BORED_2_R2_fastqc.zip EXCITED_1_R1_fastqc.html EXCITED_2_R1_fastqc.zip EXCITED_3_R2_fastqc.html multiqc_report_snidget_post_alignment_data
BORED_1_R2_fastqc.zip BORED_3_R1_fastqc.html EXCITED_1_R1_fastqc.zip EXCITED_2_R2_fastqc.html EXCITED_3_R2_fastqc.zip
BORED_2_R1_fastqc.html BORED_3_R1_fastqc.zip EXCITED_1_R2_fastqc.html EXCITED_2_R2_fastqc.zip multiqc_report_snidget.html
Copy multiqc_report_snidget_post_alignment.html to the ~/public directory to take a look.
cp QC/multiqc_report_snidget_post_alignment.html ~/public
Alignment of Golden Snidget FASTQ files - converting SAM files to BAM
Recall in the lesson that SAM files are human readable so we will need to convert these to sorted and indexed BAM files, which are machine readable for downstream steps in our analysis. How do we do this? Remember as we are working, we need to go back to the ~/biostar_class/snidget/snidget_hisat2 directory so change into this before starting this next excersie.
cd ~/biostar_class/snidget/snidget_hisat2
Solution
First, we sort the SAM file and convert it to BAM.
cat ../reads/ids.txt | parallel "samtools sort -o {}.bam {}.sam"
Second, we index the BAM file
cat ../reads/ids.txt | parallel "samtools index -b {}.bam {}.bam.bai"