Lesson 7: Introduction to Next Generation Sequencing (NGS) Data and Quality Control
Lesson 6 Review
Lesson 6 introduced the basics of RNA sequencing, including experimental considerations and basic ideas behind data analysis. In lessons 7 through 14 participants will get hands-on experience with RNA sequencing analysis.
Learning Objectives
After this class, participants will be able to:
- Describe the format in which NGS data is stored.
- Know why sequence quality score is important and how this is calculated.
- Run quality assessment on NGS data and interpret results.
Basic RNA Sequencing Analysis Workflow
The diagram below shows a basic workflow for RNA sequencing analysis, which starts with FASTQ files as input. Quality check is performed on the FASTQ files ensure that sequencing quality is good and that there are no contaminations such as those arising from adapter read through. If there are no issues arising from quality check, perform alignment to determine where in the genome each sequence comes from. If issue(s) arise from quality check, then perform trimming (either to remove low quality reads or contamination such as adapters) and perform QC to ensure the trimmed sequences are good prior to alignment. Quantification of gene expression can be performed on the aligned data. From the expression quantification, the researcher can construct visualizations (ie. PCA and heatmap) as well as perform differential expression analysis and biological interpretation.
Note
A reference genome in the form of a fasta
(or .fa
) file is needed for the alignment step while a gtf
file describing location of genomic features (ie. genes, transcripts, exons, and introns) is needed for the quantification step.
Example Data
The data used are located in /data/classes/BTEP/hcc1395_b4b
on Biowulf. This data was obtained from rnabio.org and essentially a transcriptomic profiling study comparing between tumor and normal HCC1395 breast cancer cells. There a 3 normal and 3 tumor samples for each condition.
Sign onto Biowulf
Before following along in these exercises, please sign onto Biowulf using the ssh
construct below where "user" is the class participant's assigned Biowulf student account ID.
Request an Interactive Session
Request an interactive session to perform compute intensive tasks on Biowulf by issuing the sinteractive
command construct below where:
--mem
: enables the user to request the desired amount of RAM for analysis (12gb requested in this example).--gres
: enables the user to request generic compute resources such as local temporary storage space orlscratch
(5gb requested in this example).
Copy the Data to User's /data
folder
The data is located in the hcc1395_b4b
folder of /data/classes/BTEP
. Copy this to the participants' own /data
directory in Biowulf prior to proceeding using the cp
command with the -r
option for copying folders. The directory path where the data resides is /data/classes/BTEP/hcc1395_b4b
and the destination to which it needs to copied is /data/user/
and name the copied folder hcc1395_b4b
. Be sure to change "user" to the participant's assigned Biowulf student ID.
Use the ls
command to list the content in the participant's Biowulf /data
folder to make sure the copy was successful.
FASTQ Files for hcc1395_b4b
Change into the hcc1395_b4b folder
in the participant's Biowulf /data
directory.
Next, use ls
with the -1
option to list the contents in the reads
directory one item per line.
The reads
folder contains the FASTQ or .fq
files for the hcc1395 data. There are 12 of these, two for each sample as this study generated paired end sequences.
hcc1395_normal_rep1_R1.fq
hcc1395_normal_rep1_R2.fq
hcc1395_normal_rep2_R1.fq
hcc1395_normal_rep2_R2.fq
hcc1395_normal_rep3_R1.fq
hcc1395_normal_rep3_R2.fq
hcc1395_tumor_rep1_R1.fq
hcc1395_tumor_rep1_R2.fq
hcc1395_tumor_rep2_R1.fq
hcc1395_tumor_rep2_R2.fq
hcc1395_tumor_rep3_R1.fq
hcc1395_tumor_rep3_R2.fq
Definition
In paired end sequencing, a template or unknown nucleotide fragment is sequenced from both ends, thus generating two FASTQ files per sample. A gap between the two reads is usually present in this sequencing protocol.
Image source: https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/design-considerations/.
Overview of a FASTQ File
Each FASTQ file is composed of many sequences. The tool seqkit
and its stats
function can be used to get statistics the hcc1395 FASTQ files.
Change in the reads
folder for this.
Load seqkit
.
Run seqkit stats
for all of the hcc1395 FASTQ files in one go by using *
before the .fq
extention. *
acts as a wildcard.
Query of the stats for the FASTQ files generates the results below, which informs of things such as the number of sequences (or reads) in a FASTQ file. For the hcc1395_normal_rep1 biological replicates, both files in the pair (R1 and R2) have 331,958 sequences. The average length of the sequences in these files is 151 bases. Pairs in paired end sequencing should have the same number of sequences because the reads were generated from the same template.
file format type num_seqs sum_len min_len avg_len max_len
hcc1395_normal_rep1_R1.fq FASTQ DNA 331,958 50,125,658 151 151 151
hcc1395_normal_rep1_R2.fq FASTQ DNA 331,958 50,125,658 151 151 151
hcc1395_normal_rep2_R1.fq FASTQ DNA 331,958 50,125,658 151 151 151
hcc1395_normal_rep2_R2.fq FASTQ DNA 331,958 50,125,658 151 151 151
hcc1395_normal_rep3_R1.fq FASTQ DNA 331,956 50,125,356 151 151 151
hcc1395_normal_rep3_R2.fq FASTQ DNA 331,956 50,125,356 151 151 151
hcc1395_tumor_rep1_R1.fq FASTQ DNA 390,607 58,981,657 151 151 151
hcc1395_tumor_rep1_R2.fq FASTQ DNA 390,607 58,981,657 151 151 151
hcc1395_tumor_rep2_R1.fq FASTQ DNA 390,607 58,981,657 151 151 151
hcc1395_tumor_rep2_R2.fq FASTQ DNA 390,607 58,981,657 151 151 151
hcc1395_tumor_rep3_R1.fq FASTQ DNA 390,607 58,981,657 151 151 151
hcc1395_tumor_rep3_R2.fq FASTQ DNA 390,607 58,981,657 151 151 151
Recall from the seqkit stat
results shown above that each FASTQ file contains many sequencing reads. The record for each sequencing read is composed of four lines and these include the following.
- The first line is the sequencing read metadata (ie. instrument used, location on the flow cell where the read was derived, and whether the read is the first or second of a pair in paired end sequencing) - in this example,
- "/1" at the end of the metadata line in the reads from
hcc1395_normal_rep1_R1.fq
denotes the first read of a pair - "/2" at the end of the metadata line in the reads from
hcc1395_normal_rep1_R2.fq
denotes the second read of a pair - The second line is the biological sequence
- The third line is a "+"
- The fourth line contains the quality score of each of the bases in the sequencing read. The quality score reflects the likelihood that a base at a particular location along the read is wrong.
Below, head -4
is used to show the first sequencing reads of hcc1395_normal_rep1_R1.fq
and hcc1395_normal_rep1_R2.fq
. -4
in the head
command tells it to only retrieve the first 4 lines in a file.
@K00193:38:H3MYFBBXX:4:1101:10003:44458/1
TTCCTTATGAAACAGGAAGAGTCCCTGGGCCCAGGCCTGGCCCACGGTTGTCAAGGCACATCATTGCCAGCAAGCTGAAGCATACCAGCAGCCACAACCTAGATCTCATTCCCAACCCAAAGTTCTGACTTCTGTACAAACTCGTTTCCAG
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKKFKKFKKKKF<AAKKKKKKKKKKKKKKKKFKKKFKKKKKKKKKKKFKAFKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFKFFKKKKKKKKKKKKFKKKK
@K00193:38:H3MYFBBXX:4:1101:10003:44458/2
ATGGAACTCCCTACTGACCTCTGAGAACTGGAAACGAGTTTGTACAGAAGTCAGAACTTTGGGTTGGGAATGAGATCTAGGTTGTGGCTGCTGGTATGCTTCAGCTTGCTGGCAATGATGTGCCTTGACAACCGTGGGCCAGGCCTGGGCC
+
A<FFFKKKKKKKKKKKKKKKKKKKKKKKFKAKKKKKKK7FFKKKKKKKKK7FKKFKKKKKKKKKKKKKKKKFKKFKKKKKKKAFFFKKKKFK<FKKKKKKKKKKKKFKKKKKKKKFF7AF<FK<A<<AAAAA,AA,<<AFFFFF<F<(7<A
Click here and here to learn more about the quality scores. The image below shows a table that translates the characters you see in the quality score lines to a numeric score.
Note
$$quality\ score = -10\times log10(error\ probability)$$ $$error\ probability=10^{\frac{quality\ score}{-10}}$$
Quality Assessment on the hcc1395
FASTQ Files
Go back to the hcc1395_b4b
folder.
Make a folder called pre_alignment_qc_raw
.
Load fastqc
, which is a tool used to perform quality check on Illumina sequencing data.
Change into the pre_alignment_qc_raw
folder.
Next, run fastqc
using the following command construct where:
../reads/*.fq
tellsfastqc
to look one directory back../
and then in thereads
folder and perform QC on all of the.fq
files.*
is used as a wildcard.-o
is an option infastqc
that allows for specification of output folder. In this case, the output folder should bepre_alignment_qc_raw
. Since the current directory ispre_alignment_qc_raw
, the '.' can be used to denote here, in the current folder.
List the contents for pre_alignment_qc_raw
after QC completes. FASTQC generates a html report and provides the results in a zip file for each file. The html report can be viewed on a local web browser.
hcc1395_normal_rep1_R1_fastqc.html hcc1395_tumor_rep1_R1_fastqc.html
hcc1395_normal_rep1_R1_fastqc.zip hcc1395_tumor_rep1_R1_fastqc.zip
hcc1395_normal_rep1_R2_fastqc.html hcc1395_tumor_rep1_R2_fastqc.html
hcc1395_normal_rep1_R2_fastqc.zip hcc1395_tumor_rep1_R2_fastqc.zip
hcc1395_normal_rep2_R1_fastqc.html hcc1395_tumor_rep2_R1_fastqc.html
hcc1395_normal_rep2_R1_fastqc.zip hcc1395_tumor_rep2_R1_fastqc.zip
hcc1395_normal_rep2_R2_fastqc.html hcc1395_tumor_rep2_R2_fastqc.html
hcc1395_normal_rep2_R2_fastqc.zip hcc1395_tumor_rep2_R2_fastqc.zip
hcc1395_normal_rep3_R1_fastqc.html hcc1395_tumor_rep3_R1_fastqc.html
hcc1395_normal_rep3_R1_fastqc.zip hcc1395_tumor_rep3_R1_fastqc.zip
hcc1395_normal_rep3_R2_fastqc.html hcc1395_tumor_rep3_R2_fastqc.html
hcc1395_normal_rep3_R2_fastqc.zip hcc1395_tumor_rep3_R2_fastqc.zip
Copy to hcc1395_normal_rep1_R1_fastqc.html
to local computer to learn how to interpret a FASTQC report. To do this open another terminal (Mac) or command prompt (Windows) and change into the local Downloads
folder.
In the scp
construct below, change user to the participant's assigned Biowulf student ID and can be broken down as follows.
user@helix.nih.gov
: connects the user tohelix
, which is the HPC partition used for interactive data transfer.:
separates thehelix
login part from the path of the file or folder to transfer, which is/data/user/hcc1395_b4b/pre_alignment_qc_raw/hcc1395_tumor_rep1_R1_fastqc.html
..
: specifies forscp
to download to the current directory, which should be the localDownloads
folder.
Enter the user's Biowulf password when prompted and the download should commence.
scp user@helix.nih.gov:/data/user/hcc1395_b4b/pre_alignment_qc_raw/hcc1395_tumor_rep1_R1_fastqc.html .
Interpreting FASTQC Report
Report Navigator and Summary
The first item is the navigation panel that enables scientists to quickly link to different portions of the FASTQC report. The QC modules that passed are indicated by a green circle with a check. Those with warnings are denoted by a yellow cirlces with a "!". Failed modules are indicated by a red circle with "x".
Next there is a summary statistics table, which includes the following information.
- Name of the FASTQ file in which the report was generated for (hcc1395_tumor_rep1_R1.fq in this case)
- The number of sequences in the file.
- Number of sequences flagged with poor quality.
- The sequence length (ie. how many bases are in each sequencing read of the FASTQ file).
Per Base Sequence Quality
Next, there is the "Per base sequence quality" plot.
This chart plots the error likelihood at each base position averaged over all sequences in a FASTQ file.
-
On the vertical axis are the quality scores that are in row 4 of the sequencing reads in a FASTQ file. These quality scores represent error probabilities, where:
- 10 corresponds to 10% error (1/10),
- 20 corresponds to 1% error (1/100),
- 30 corresponds to 0.1% error (1/1000) and
- 40 corresponds to one error every 10,000 measurements (1/10,000) that is an error rate of 0.01%
-
The three colored bands (green, yellow, red) illustrate the typical labels assigned to these measure:
- reliable (28-40, green)
- less reliable (20-28, yellow)
- error prone (1-20, red)
-
The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.
-
The red line inside the yellow boxes is the median quality score for that base.
-
The blue line is the average quality score at a particular base.
Per Tile Sequence Quality
Next, there is the "Per tile sequence quality" plot. This graph informs whether something is wrong with a tile in the flow cell. Along the horizontal axis are the base positions. The vertical axis represents the tile number in which the read came from.
This plot compares the average quality score of a tile to the average quality score of all tiles at a particular base position. -- https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/
The interpretation for this plot is as follows
- Colder colors indicate that the average quality score at a tile is at or above the average quality score of all tiles at a particular base position. So a plot that is entirely blue is good.
- Hotter colors indicate that the average quality score at a tile is below the average quality score of all tiles at a particular base position. So a plot with red indicates a part of the flow cell has problems.
Bad per tile quality. Source: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html
Per Sequence Quality Scores
Next, the distribution of the sequence quality scores is shown in the "Per sequence quality scores" plot. This graph can reveal whether the quality of a subset of sequences is poor. Here the sequences have good quality, where most reads have quality that clusters at around 37 or above (~0.02% error likelihood).
Per Base Sequence Content
The next figure shows "Per base sequence content", which is essentially the sequence make up along the bases of reads in the FASTQ file. If a library is random, then the percent composition of each nucleotide base (A,T,C,G) should be the same (~25%).
This module fails for hcc1395_tumor_rep1_R1.fq
because the difference between the percentage of A and the percentage of T is larger than 20 at a particular location OR the difference between the percentage of C and the percentage of G is larger than 20 at a particular location -- Babraham bioinformatics Per base sequence content.
In hcc1395_tumor_rep1_R1.fq
, it looks like the difference between the percent composition of A and T at base position 2 is causing the failure. Unfortunately, this type of uneveness in base distribution at the beginning of a read is observed in RNA sequencing due to priming with random hexamers during the library preparation stage.
Per Sequence GC Content
Next, the GC content across each sequence compared to a normal distribution in what is shown in the "Per sequence GC content" plot. The GC content in hcc1395_tumor_rep1_R1.fq
is off from the normal theoretical distribution.
The peak of this theoretical distribution is an estimate of the GC content of the underlying genome. Deviation from of the GC content from the theoretical distribution could be caused by contamination or sequencing bias. -- Babraham bioinformatics Per base sequence content.
Per Base N Content
Whether there are any unknown bases in the sequencing reads is shown in next plot, which is the "Per base n content".
The sequence length distribution of hcc1395_tumor_rep1_R1.fq
in what is known as the "Sequence Length Distribution" plot. All sequences have 151 bases as trimming has not been performed.
Sequence Duplication Level
The plot below shows the sequence duplication levels. High levels of duplication may indicate an enrichment bias such as over-amplification in the PCR step. However, in RNA sequencing, duplicate sequences could be biologically meaningful as these could be derived from multiple copies of the same transcript.
See here for advice regarding handling duplicates in RNA sequencing.
Overrepresented Sequences
The overrepresented sequences graph is shown next. Presence of overrepresented sequences may indicate enrichment artifact or the sequencing library is not very diverse. On the other hand, overrepresented sequences could have biological significance.
Adapter Contamination
If there is adapter contamination in the sequences, then this will be shown in the next plot.
Conclusions from hcc1395_normal_rep1_R1.fastq
Quality Assessment
In general, FASTQC shows that the file hcc1395_normal_rep1_R1.fastq
have good quality sequences (less than 1% error likelihood). The concern is the presence of Illuminal universal adapters.
Merging Multiple FASTQC Reports into One
Reviewing multiple FASTQC reports is cumbersome but there is an application called MultiQC that allows scientists to combine many FASTQC reports into one.
To use MultiQC on Biowulf (or other HPC systems), first load it using the following.
The --filename
option allows users to enter a base name (ie. file name without an extension) for the output MultiQC report. Here, the base name pre_alignment_qc_raw
will be used.
MultiQC generates a folder that contains the QC data should the researcher choose to interrogate these programmatically. It provides a report in the html format as well, which can be downloaded to local computer and viewed in a web browser.
Take a look at the MultiQC html report by copying it to the Downloads folder on local computer using scp
.
MultiQC Navigation Panel and Summary Statisics
Upon opening the MultiQC report, the user will be greeted with a navigation panel (similar to that found in the individual FASTQC reports) that allows users to quickly move to different sections of the report. A link to get help for using the MultiQC report is available as well. To the right of the report page, there is a tool box that allows the researcher to do things like highlighting different samples in different colors for better visualization, rename samples, and export each of the individual QC plots as an image for inclusion in presentations and/or publications. MultiQC reports are interactive.
The General Statistics table shows some basic statistics about the samples, including percentage of duplicate reads, GC content, and number of sequences (reported as million of sequences in the M Seqs column).
Click on the Configure Columns tab to choose which columns to include in the General Statistics table and the plot button to visualize the data in graphical format.
Altering the way the total number of sequences is reported
The number of sequences could be reported as the actual number rather than in millions. Users can alter this by issuing the `--config` option in `multiqc` and specifying a custom report configuration `yaml` file. The content for this configuration file is shown below and can be explained as follows. * `read_count_multiplier`: set to 1 to report the actual sequence number. * `read_count_prefix`: set to "" to remove the units at the end of the number of sequences. The default is "M" for milion. * Under `table_columns_name` change the `total_sequences` column heading to "# Sequences" or any name that reflects the fact the actual number of sequences is being reported.MultiQC Sequence Count
The Sequence Count plot shows the break down of unique and duplicate reads for each FASTQ file. Again, duplication suggests some sort of enrichment bias or in the case of RNA sequencing, it could be that the duplicates are biologically meaningful due to multiple copies of the same trancript being expressed.
MultiQC Mean Quality Score
The average quality score of the sequencing reads in FASTQ files along each base position is shown in the figure below. Regarding the boxes at the top of the QC plots, green means QC passed while orange and red indicate warning and failed, respectively.
Click on the green rectangle labeled with the number of files to filter out samples to see in the plot.
Upon clicking on the green rectangled labeled with the number of files, a tool box will appear and users can click on the "x" next to each file to deselect and remove it from the visualization.
MultiQC Quality Score Distribution
The quality score distributions of each of the FASTQ files are plotted in the next figure.
MultiQC Per Base Sequence Content
An interactive heatmap of the percent composition of each nucleotide base (A,T,C,G) along the bases (horizontal axis) for each of the FASTQ files (vertical axis) is presented next. Hover over a tile to see the corresponding numbers for a given sample. Click on any row in this heatmap to get the base composition plot for just that sample.
MultiQC GC Distribution
The GC distribution for each of the FASTQ files is shown in figure below.
MultiQC Per Base N Content
The percentage of unknown bases (or N) per FASTQ file is available as the next figure.
MultiQC Sequence Length Distribution
As shown below, all FASTQ files for the hcc13995 study have sequences that contain 151 bases.
MultiQC Duplication Level
The percentage of sequences with the duplication level shown on the horizontal axis.
MultiQC Adapter Contamination
The sequences in the hcc1395 data have adapter contamination. Mousing over the area where the adpater content starts to increase suggests Illumina Universal Adapter.
Note
FASTQC will issue a warning if any sequence is present in more than 5% of all reads and failure if any sequence is present in more than 10% of all reads. -- FASTQC manual
MultiQC Status Heatmap
Finally, a heatmap showing the per file QC modules that have passed, warn, or failed is provided.