Lesson 10 Practice

Objectives

In this lesson, we introduced the structure of the FASTQ file and learned to assess quality of raw sequencing data using FASTQC. Here, we will practice what we learned using the Golden Snidget dataset.

Where is my data?

Recall that the Golden Snidget data resides in ~/biostar_class/snidget folder. Can you change into the folder and find where the sequencing reads are (ie. in which folder they are located)?

Solution

cd ~/biostar_class/snidget

The sequencing reads are in the reads folder

cd reads

How many sequencing (fq) files do we have?

Solution

ls

From the names of the FASTQ (fq) files, are these from paired or single end sequencing?

Solution

Paired

Can you find the first sequencing read in the file BORED_1_R1.fq? If you can, can you identify the sequencing header line and the quality score line?

Solution

head -4 BORED_1_R1.fq

From what you know about the structure of FASTQ files, how many reads are in BORED_1_R1.fq? There are two ways you can find out.

Solution

grep @HWI BORED_1_R1.fq | wc -l

We can also use seqkit stats

seqkit stat BORED_1_R1.fq

What can we do to get stats such as the number of reads and read length for all of the Golden Snidget FASTQ files?

Solution

seqkit stats BORED_*.fq EXCITED_*.fq

file             format  type  num_seqs     sum_len  min_len  avg_len  max_len
BORED_1_R1.fq    FASTQ   DNA    112,193  11,219,300      100      100      100
BORED_1_R2.fq    FASTQ   DNA    112,193  11,219,300      100      100      100
BORED_2_R1.fq    FASTQ   DNA    137,581  13,758,100      100      100      100
BORED_2_R2.fq    FASTQ   DNA    137,581  13,758,100      100      100      100
BORED_3_R1.fq    FASTQ   DNA    123,093  12,309,300      100      100      100
BORED_3_R2.fq    FASTQ   DNA    123,093  12,309,300      100      100      100
EXCITED_1_R1.fq  FASTQ   DNA    237,018  23,701,800      100      100      100
EXCITED_1_R2.fq  FASTQ   DNA    237,018  23,701,800      100      100      100
EXCITED_2_R1.fq  FASTQ   DNA    158,009  15,800,900      100      100      100
EXCITED_2_R2.fq  FASTQ   DNA    158,009  15,800,900      100      100      100
EXCITED_3_R1.fq  FASTQ   DNA    196,673  19,667,300      100      100      100
EXCITED_3_R2.fq  FASTQ   DNA    196,673  19,667,300      100      100      100

How do we visualize quality metrics for the Golden Snidget sequencing reads?

Solution

fastqc BORED_*.fq EXCITED_*.fq

Look at the quality check report for BORED_1_R1.fq. But first copy it to the ~/public folder.

Solution

cp BORED_1_R1_fastqc.html ~/public

How many modules passed QC?
Do the total number of sequences and does sequence length match those obtained from seqkit stats?
How are the qualities of the sequencing reads?
What are the overrepresented sequences? BLAST one of the sequences to find out.
Is there adapter contamination?

Solution

FASTQC results for BORED_1_R1.fq