08. Assaying sequence quality with FASTQC copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Learn:
- FASTQC for assaying quality of sequence reads
- MultiQC for combining multiple FASTQC reports into one report
- Trimmomatic for removing sequence data based on low quality scores
Always start by activating the bioinformatics environment.
conda activate bioinfo
cd biostar_class
mkdir sequence
curl http://data.biostarhandbook.com/data/sequencing-platform-data.tar.gz --output sequencing-platform-data.tar.gz
tar xzvf sequencing-platform-data.tar.gz
which shows the file names as they are being uncompressed.
x illumina.fq
x iontorrent.fq
x pacbio.fq
x minion.fq
- FASTQC is a quality control tool for high throughput sequence data.
- Developed by Babraham Institute Bioinformatics.
- See the tutorial here.
Now we can run the FASTQC tool:
fastqc illumina.fq
Started analysis of illumina.fq
Approx 20% complete for illumina.fq
Approx 40% complete for illumina.fq
Approx 60% complete for illumina.fq
Approx 80% complete for illumina.fq
Approx 100% complete for illumina.fq
Analysis complete for illumina.fq
First we see the summary info.
Next, we can see the "Per base sequence quality".
This chart plots the error likelihood at each position averaged over all measurements.
The vertical axis are the FASTQ scores that represent error probabilities:
- 10 corresponds to 10% error (1/10),
- 20 corresponds to 1% error (1/100),
- 30 corresponds to 0.1% error (1/1000) and
- 40 corresponds to one error every 10,000 measurements (1/10,000) that is an error rate of 0.01%
The three colored bands (green, yellow, red) illustrate the typical labels assigned to these measure:
- reliable (30-40, green)
- less reliable (20-30, yellow)
- and error prone (1-20, red)
The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.
"Per tile sequence quality"
This graph would only show something if there was a loss of quality (Illumina library) with a part of the flow cell. A plot that shows blue all over is a good result.
Bad data
And "Per sequence quality scores"
This report allows you to see whether a group of your sequences are of low quality.
"Per base sequence content"
This graph shows the proportion of each of the four nucleotide bases in the sequences.
"Per sequence GC content"
This graph shows the GC content across each sequence compared to a normal distribution.
"Per base N content"
Uncalled bases are assigned "N". This graph shows where there are N's being called.
"Sequence Length Distribution"
Here we see a graph showing the distribution of sequence lengths in the data.
"Sequence Duplication Levels"
High levels of duplication may indicate an enrichment bias such as over-amplification in the PCR step. Otherwise, most sequences will occur only once.
"Overrepresented sequences and Adapter Content"
Sequences should be diverse, without any individual sequence being present at a high level. If this occurs, it may be an enrichment artifact or biologically significant.