08. Assaying sequence quality with FASTQC copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Learn:

FASTQC for assaying quality of sequence reads
MultiQC for combining multiple FASTQC reports into one report
Trimmomatic for removing sequence data based on low quality scores

Always start by activating the bioinformatics environment.

conda activate bioinfo

Let's go to our biostar_class directory and create a new directory named "sequence"

cd biostar_class
mkdir sequence

Now we need to download the sequence data from the Biostar Handbook web page:

curl http://data.biostarhandbook.com/data/sequencing-platform-data.tar.gz --output sequencing-platform-data.tar.gz

And decompress the archive

tar xzvf sequencing-platform-data.tar.gz

which shows the file names as they are being uncompressed.

x illumina.fq
x iontorrent.fq
x pacbio.fq
x minion.fq

FASTQC is a quality control tool for high throughput sequence data.
Developed by Babraham Institute Bioinformatics.
See the tutorial here.

Now we can run the FASTQC tool:

fastqc illumina.fq

which shows the following output

Started analysis of illumina.fq
Approx 20% complete for illumina.fq
Approx 40% complete for illumina.fq
Approx 60% complete for illumina.fq
Approx 80% complete for illumina.fq
Approx 100% complete for illumina.fq
Analysis complete for illumina.fq

Now, open a web browser, choose "File -> Open File" and follow the path to the illumina_fastqc.html file that was generated by FASTQC.

First we see the summary info.

Next, we can see the "Per base sequence quality".

This chart plots the error likelihood at each position averaged over all measurements.

The vertical axis are the FASTQ scores that represent error probabilities:

10 corresponds to 10% error (1/10),
20 corresponds to 1% error (1/100),
30 corresponds to 0.1% error (1/1000) and
40 corresponds to one error every 10,000 measurements (1/10,000) that is an error rate of 0.01%

The three colored bands (green, yellow, red) illustrate the typical labels assigned to these measure:

reliable (30-40, green)
less reliable (20-30, yellow)
and error prone (1-20, red)

The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.

"Per tile sequence quality"

This graph would only show something if there was a loss of quality (Illumina library) with a part of the flow cell. A plot that shows blue all over is a good result.

Bad data

And "Per sequence quality scores"

This report allows you to see whether a group of your sequences are of low quality.

"Per base sequence content"

This graph shows the proportion of each of the four nucleotide bases in the sequences.

"Per sequence GC content"

This graph shows the GC content across each sequence compared to a normal distribution.

"Per base N content"

Uncalled bases are assigned "N". This graph shows where there are N's being called.

"Sequence Length Distribution"

Here we see a graph showing the distribution of sequence lengths in the data.

"Sequence Duplication Levels"

High levels of duplication may indicate an enrichment bias such as over-amplification in the PCR step. Otherwise, most sequences will occur only once.

"Overrepresented sequences and Adapter Content"

Sequences should be diverse, without any individual sequence being present at a high level. If this occurs, it may be an enrichment artifact or biologically significant.