Pre-alignment QC

The first step in analyzing RNA sequencing is to perform quality assessment of the FASTQ files. This step ensures that the quality of the data is good and there no issues with contaminations such as those arising from adapter read through.

To run pre-alignment QC, just click on the FASTQ data node and select QA/QC in the menu panel on the right of the analysis page. From there, select "Pre-alignment QA/QC" and make sure "All reads" is checked so that QC is performed on all reads in the FASTQ files. Then, run QC with the defaults.

Note

The K-mer length option when checked can be used to determine whether there are contamination such adapters in the sequencing data. However, because the adapter sequences are available for use during the trimming procedure, this option will not be used.

When the "Pre-alignment QA/QC" step completes, double click on the task node to view the results.

The first item in the "Pre-alignment QA/QC" report is a summary table and the columns in the table can be interpreted as follows:

Sample name: This indicates name of the sample. Partek Flow gives QC results on a per sample basis.
Total reads: This is the total number of reads (or sequences) in the sample. For paired end sequencing, this refers to the total number of read pairs in the sample.
Average read length: This reports the average length of the reads in the sample.
Average read quality: Average of quality score for all reads in a sample is provided in this column. Higher numbers indicate that there is low likelihood for sequencing error. The samples in this dataset have high quality sequences. For instance, a quality score of 38 indicates a 0.0158% error likelihood.
% N: This is the percentage of the unknown bases in the reads for a sample.
% GC: This column shows the percent of the bases in the reads for a sample that are either G or C.

The Pre-alignmnet QA/QC module averages the quality score of each base position along all sequences in a sample and results are shown the "Average base quality score per position" plot, which shows the average quality at each position for all reads/sequences in a sample. The figure below shows that each base position has an average quality of 30 or above, which indicates less than or equal to 0.1% error likelihood.

Next, the average quality score for each read in a sample is calculated and the distribution of the percentage of reads with a given average quality is generated. The image below shows that most of the sequences in the study samples have an average quality of 30 or above.

Click on any of the samples to view the sample-level QC results. This report contains a plot showing base composition for the reads in a sample.

Tip

"In a random library you would expect that there would be little to no difference between the different bases of a sequence run, so the lines in this plot should run parallel with each other. The relative amount of each base should reflect the overall amount of these bases in your genome, but in any case they should not be hugely imbalanced from each other.

It's worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. Libraries produced by priming using random hexamers (including nearly all RNA-Seq libraries) and those which were fragmented using transposases inherit an intrinsic bias in the positions at which reads start. This bias does not concern an absolute sequence, but instead provides enrichement of a number of different K-mers at the 5' end of the reads. Whilst this is a true technical bias, it isn't something which can be corrected by trimming and in most cases doesn't seem to adversely affect the downstream analysis. It will however produce a warning or error in this module." -- FASTQC manual

The sample-level report includes a plot showing average and range of quality score along each base position for all reads.

The sample-level read quality distribution is also provided.

A sample-level read length distribution plot is available as well. Note that prior to either quality or adapter trimming, all of the reads have the same amount of bases (151 in this example) as shown by the read length distribution plot.