Lesson 10 Practice
Objectives
In this lesson, we introduced the structure of the FASTQ file and learned to assess quality of raw sequencing data using FASTQC. Here, we will practice what we learned using the Golden Snidget dataset.
Where is my data?
Recall that the Golden Snidget data resides in ~/biostar_class/snidget folder. Can you change into the folder and find where the sequencing reads are (ie. in which folder they are located)?
Solution
cd ~/biostar_class/snidget
The sequencing reads are in the reads folder
cd reads
How many sequencing (fq) files do we have?
Solution
ls
12
From the names of the FASTQ (fq) files, are these from paired or single end sequencing?
Solution
Paired
Can you find the first sequencing read in the file BORED_1_R1.fq? If you can, can you identify the sequencing header line and the quality score line?
Solution
head -4 BORED_1_R1.fq
From what you know about the structure of FASTQ files, how many reads are in BORED_1_R1.fq? There are two ways you can find out.
Solution
grep @HWI BORED_1_R1.fq | wc -l
112193
We can also use seqkit stats
seqkit stat BORED_1_R1.fq
What can we do to get stats such as the number of reads and read length for all of the Golden Snidget FASTQ files?
Solution
seqkit stats BORED_*.fq EXCITED_*.fq
file format type num_seqs sum_len min_len avg_len max_len
BORED_1_R1.fq FASTQ DNA 112,193 11,219,300 100 100 100
BORED_1_R2.fq FASTQ DNA 112,193 11,219,300 100 100 100
BORED_2_R1.fq FASTQ DNA 137,581 13,758,100 100 100 100
BORED_2_R2.fq FASTQ DNA 137,581 13,758,100 100 100 100
BORED_3_R1.fq FASTQ DNA 123,093 12,309,300 100 100 100
BORED_3_R2.fq FASTQ DNA 123,093 12,309,300 100 100 100
EXCITED_1_R1.fq FASTQ DNA 237,018 23,701,800 100 100 100
EXCITED_1_R2.fq FASTQ DNA 237,018 23,701,800 100 100 100
EXCITED_2_R1.fq FASTQ DNA 158,009 15,800,900 100 100 100
EXCITED_2_R2.fq FASTQ DNA 158,009 15,800,900 100 100 100
EXCITED_3_R1.fq FASTQ DNA 196,673 19,667,300 100 100 100
EXCITED_3_R2.fq FASTQ DNA 196,673 19,667,300 100 100 100
How do we visualize quality metrics for the Golden Snidget sequencing reads?
Solution
fastqc BORED_*.fq EXCITED_*.fq
Look at the quality check report for BORED_1_R1.fq. But first copy it to the ~/public folder.
Solution
cp BORED_1_R1_fastqc.html ~/public
- How many modules passed QC?
- Do the total number of sequences and does sequence length match those obtained from seqkit stats?
- How are the qualities of the sequencing reads?
- What are the overrepresented sequences? BLAST one of the sequences to find out.
- Is there adapter contamination?
Solution
FASTQC results for BORED_1_R1.fq