Lesson 7: Introduction to Next Generation Sequencing (NGS) Data and Quality Control

Lesson 6 Review

Lesson 6 introduced the basics of RNA sequencing, including experimental considerations and basic ideas behind data analysis. In lessons 7 through 14 participants will get hands-on experience with RNA sequencing analysis.

Learning Objectives

After this class, participants will be able to:

Describe the format in which NGS data is stored.
Know why sequence quality score is important and how this is calculated.
Run quality assessment on NGS data and interpret results.

Basic RNA Sequencing Analysis Workflow

The diagram below shows a basic workflow for RNA sequencing analysis, which starts with FASTQ files as input. Quality check is performed on the FASTQ files ensure that sequencing quality is good and that there are no contaminations such as those arising from adapter read through. If there are no issues arising from quality check, perform alignment to determine where in the genome each sequence comes from. If issue(s) arise from quality check, then perform trimming (either to remove low quality reads or contamination such as adapters) and perform QC to ensure the trimmed sequences are good prior to alignment. Quantification of gene expression can be performed on the aligned data. From the expression quantification, the researcher can construct visualizations (ie. PCA and heatmap) as well as perform differential expression analysis and biological interpretation.

Note

A reference genome in the form of a fasta (or .fa) file is needed for the alignment step while a gtf file describing location of genomic features (ie. genes, transcripts, exons, and introns) is needed for the quantification step.

Example Data

The data used are located in /data/classes/BTEP/hcc1395_b4b on Biowulf. This data was obtained from rnabio.org and essentially a transcriptomic profiling study comparing between tumor and normal HCC1395 breast cancer cells. There a 3 normal and 3 tumor samples for each condition.

Sign onto Biowulf

Before following along in these exercises, please sign onto Biowulf using the ssh construct below where "user" is the class participant's assigned Biowulf student account ID.

ssh user@biowulf.nih.gov

Request an Interactive Session

Request an interactive session to perform compute intensive tasks on Biowulf by issuing the sinteractive command construct below where:

--mem: enables the user to request the desired amount of RAM for analysis (12gb requested in this example).
--gres: enables the user to request generic compute resources such as local temporary storage space or lscratch (5gb requested in this example).

sinteractive --mem=12gb --gres=lscratch:5

Copy the Data to User's `/data` folder

The data is located in the hcc1395_b4b folder of /data/classes/BTEP. Copy this to the participants' own /data directory in Biowulf prior to proceeding using the cp command with the -r option for copying folders. The directory path where the data resides is /data/classes/BTEP/hcc1395_b4b and the destination to which it needs to copied is /data/user/ and name the copied folder hcc1395_b4b. Be sure to change "user" to the participant's assigned Biowulf student ID.

cp -r /data/classes/BTEP/hcc1395_b4b /data/user/hcc1395_b4b

Use the ls command to list the content in the participant's Biowulf /data folder to make sure the copy was successful.

ls

hcc1395_b4b

FASTQ Files for `hcc1395_b4b`

Change into the hcc1395_b4b folder in the participant's Biowulf /data directory.

cd hcc1395_b4b

Next, use ls with the -1 option to list the contents in the reads directory one item per line.

ls -1 reads

The reads folder contains the FASTQ or .fq files for the hcc1395 data. There are 12 of these, two for each sample as this study generated paired end sequences.

hcc1395_normal_rep1_R1.fq
hcc1395_normal_rep1_R2.fq
hcc1395_normal_rep2_R1.fq
hcc1395_normal_rep2_R2.fq
hcc1395_normal_rep3_R1.fq
hcc1395_normal_rep3_R2.fq
hcc1395_tumor_rep1_R1.fq
hcc1395_tumor_rep1_R2.fq
hcc1395_tumor_rep2_R1.fq
hcc1395_tumor_rep2_R2.fq
hcc1395_tumor_rep3_R1.fq
hcc1395_tumor_rep3_R2.fq

Definition

In paired end sequencing, a template or unknown nucleotide fragment is sequenced from both ends, thus generating two FASTQ files per sample. A gap between the two reads is usually present in this sequencing protocol.
Image source: https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/rna-sequencing/performing-a-rna-seq-experiment/design-considerations/.

Overview of a FASTQ File

Each FASTQ file is composed of many sequences. The tool seqkit and its stats function can be used to get statistics the hcc1395 FASTQ files.

Change in the reads folder for this.

cd reads

Load seqkit.

module load seqkit

[+] Loading seqkit  2.7.0

Run seqkit stats for all of the hcc1395 FASTQ files in one go by using * before the .fq extention. * acts as a wildcard.

seqkit stats *.fq

Query of the stats for the FASTQ files generates the results below, which informs of things such as the number of sequences (or reads) in a FASTQ file. For the hcc1395_normal_rep1 biological replicates, both files in the pair (R1 and R2) have 331,958 sequences. The average length of the sequences in these files is 151 bases. Pairs in paired end sequencing should have the same number of sequences because the reads were generated from the same template.

file                       format  type  num_seqs     sum_len  min_len  avg_len  max_len
hcc1395_normal_rep1_R1.fq  FASTQ   DNA    331,958  50,125,658      151      151      151
hcc1395_normal_rep1_R2.fq  FASTQ   DNA    331,958  50,125,658      151      151      151
hcc1395_normal_rep2_R1.fq  FASTQ   DNA    331,958  50,125,658      151      151      151
hcc1395_normal_rep2_R2.fq  FASTQ   DNA    331,958  50,125,658      151      151      151
hcc1395_normal_rep3_R1.fq  FASTQ   DNA    331,956  50,125,356      151      151      151
hcc1395_normal_rep3_R2.fq  FASTQ   DNA    331,956  50,125,356      151      151      151
hcc1395_tumor_rep1_R1.fq   FASTQ   DNA    390,607  58,981,657      151      151      151
hcc1395_tumor_rep1_R2.fq   FASTQ   DNA    390,607  58,981,657      151      151      151
hcc1395_tumor_rep2_R1.fq   FASTQ   DNA    390,607  58,981,657      151      151      151
hcc1395_tumor_rep2_R2.fq   FASTQ   DNA    390,607  58,981,657      151      151      151
hcc1395_tumor_rep3_R1.fq   FASTQ   DNA    390,607  58,981,657      151      151      151
hcc1395_tumor_rep3_R2.fq   FASTQ   DNA    390,607  58,981,657      151      151      151

Recall from the seqkit stat results shown above that each FASTQ file contains many sequencing reads. The record for each sequencing read is composed of four lines and these include the following.

The first line is the sequencing read metadata (ie. instrument used, location on the flow cell where the read was derived, and whether the read is the first or second of a pair in paired end sequencing) - in this example,
"/1" at the end of the metadata line in the reads from hcc1395_normal_rep1_R1.fq denotes the first read of a pair
"/2" at the end of the metadata line in the reads from hcc1395_normal_rep1_R2.fq denotes the second read of a pair
The second line is the biological sequence
The third line is a "+"
The fourth line contains the quality score of each of the bases in the sequencing read. The quality score reflects the likelihood that a base at a particular location along the read is wrong.

Below, head -4 is used to show the first sequencing reads of hcc1395_normal_rep1_R1.fq and hcc1395_normal_rep1_R2.fq. -4 in the head command tells it to only retrieve the first 4 lines in a file.

head -4 hcc1395_normal_rep1_R1.fq

@K00193:38:H3MYFBBXX:4:1101:10003:44458/1
TTCCTTATGAAACAGGAAGAGTCCCTGGGCCCAGGCCTGGCCCACGGTTGTCAAGGCACATCATTGCCAGCAAGCTGAAGCATACCAGCAGCCACAACCTAGATCTCATTCCCAACCCAAAGTTCTGACTTCTGTACAAACTCGTTTCCAG
+
AAFFFKKKKKKKKKKKKKKKKKKKKKKKKFKKFKKKKF<AAKKKKKKKKKKKKKKKKFKKKFKKKKKKKKKKKFKAFKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFKFFKKKKKKKKKKKKFKKKK

head -4 hcc1395_normal_rep1_R2.fq

@K00193:38:H3MYFBBXX:4:1101:10003:44458/2
ATGGAACTCCCTACTGACCTCTGAGAACTGGAAACGAGTTTGTACAGAAGTCAGAACTTTGGGTTGGGAATGAGATCTAGGTTGTGGCTGCTGGTATGCTTCAGCTTGCTGGCAATGATGTGCCTTGACAACCGTGGGCCAGGCCTGGGCC
+
A<FFFKKKKKKKKKKKKKKKKKKKKKKKFKAKKKKKKK7FFKKKKKKKKK7FKKFKKKKKKKKKKKKKKKKFKKFKKKKKKKAFFFKKKKFK<FKKKKKKKKKKKKFKKKKKKKKFF7AF<FK<A<<AAAAA,AA,<<AFFFFF<F<(7<A

Click here and here to learn more about the quality scores. The image below shows a table that translates the characters you see in the quality score lines to a numeric score.

Note

$$quality\ score = -10\times log10(error\ probability)$$ $$error\ probability=10^{\frac{quality\ score}{-10}}$$

Quality Assessment on the `hcc1395` FASTQ Files

Go back to the hcc1395_b4b folder.

cd /data/users/hcc1395_b4b

Make a folder called pre_alignment_qc_raw.

mkdir pre_alignment_qc_raw

Load fastqc, which is a tool used to perform quality check on Illumina sequencing data.

module load fastqc

[+] Loading singularity  4.1.5  on cn0047 
[+] Loading fastqc  0.12.1

Change into the pre_alignment_qc_raw folder.

cd  pre_alignment_qc_raw

Next, run fastqc using the following command construct where:

../reads/*.fq tells fastqc to look one directory back ../ and then in the reads folder and perform QC on all of the .fq files. * is used as a wildcard.
-o is an option in fastqc that allows for specification of output folder. In this case, the output folder should be pre_alignment_qc_raw. Since the current directory is pre_alignment_qc_raw, the '.' can be used to denote here, in the current folder.

fastqc ../reads/*.fq -o .

List the contents for pre_alignment_qc_raw after QC completes. FASTQC generates a html report and provides the results in a zip file for each file. The html report can be viewed on a local web browser.

ls

hcc1395_normal_rep1_R1_fastqc.html  hcc1395_tumor_rep1_R1_fastqc.html
hcc1395_normal_rep1_R1_fastqc.zip   hcc1395_tumor_rep1_R1_fastqc.zip
hcc1395_normal_rep1_R2_fastqc.html  hcc1395_tumor_rep1_R2_fastqc.html
hcc1395_normal_rep1_R2_fastqc.zip   hcc1395_tumor_rep1_R2_fastqc.zip
hcc1395_normal_rep2_R1_fastqc.html  hcc1395_tumor_rep2_R1_fastqc.html
hcc1395_normal_rep2_R1_fastqc.zip   hcc1395_tumor_rep2_R1_fastqc.zip
hcc1395_normal_rep2_R2_fastqc.html  hcc1395_tumor_rep2_R2_fastqc.html
hcc1395_normal_rep2_R2_fastqc.zip   hcc1395_tumor_rep2_R2_fastqc.zip
hcc1395_normal_rep3_R1_fastqc.html  hcc1395_tumor_rep3_R1_fastqc.html
hcc1395_normal_rep3_R1_fastqc.zip   hcc1395_tumor_rep3_R1_fastqc.zip
hcc1395_normal_rep3_R2_fastqc.html  hcc1395_tumor_rep3_R2_fastqc.html
hcc1395_normal_rep3_R2_fastqc.zip   hcc1395_tumor_rep3_R2_fastqc.zip

Copy to hcc1395_normal_rep1_R1_fastqc.html to local computer to learn how to interpret a FASTQC report. To do this open another terminal (Mac) or command prompt (Windows) and change into the local Downloads folder.

cd Downloads

In the scp construct below, change user to the participant's assigned Biowulf student ID and can be broken down as follows.

user@helix.nih.gov: connects the user to helix, which is the HPC partition used for interactive data transfer.
: separates the helix login part from the path of the file or folder to transfer, which is /data/user/hcc1395_b4b/pre_alignment_qc_raw/hcc1395_tumor_rep1_R1_fastqc.html.
.: specifies for scp to download to the current directory, which should be the local Downloads folder.

Enter the user's Biowulf password when prompted and the download should commence.

scp user@helix.nih.gov:/data/user/hcc1395_b4b/pre_alignment_qc_raw/hcc1395_tumor_rep1_R1_fastqc.html .

Interpreting FASTQC Report

Report Navigator and Summary

The first item is the navigation panel that enables scientists to quickly link to different portions of the FASTQC report. The QC modules that passed are indicated by a green circle with a check. Those with warnings are denoted by a yellow cirlces with a "!". Failed modules are indicated by a red circle with "x".

Next there is a summary statistics table, which includes the following information.

Name of the FASTQ file in which the report was generated for (hcc1395_tumor_rep1_R1.fq in this case)
The number of sequences in the file.
Number of sequences flagged with poor quality.
The sequence length (ie. how many bases are in each sequencing read of the FASTQ file).

Per Base Sequence Quality

Next, there is the "Per base sequence quality" plot.

This chart plots the error likelihood at each base position averaged over all sequences in a FASTQ file.

On the vertical axis are the quality scores that are in row 4 of the sequencing reads in a FASTQ file. These quality scores represent error probabilities, where:
- 10 corresponds to 10% error (1/10),
- 20 corresponds to 1% error (1/100),
- 30 corresponds to 0.1% error (1/1000) and
- 40 corresponds to one error every 10,000 measurements (1/10,000) that is an error rate of 0.01%
The three colored bands (green, yellow, red) illustrate the typical labels assigned to these measure:
- reliable (28-40, green)
- less reliable (20-28, yellow)
- error prone (1-20, red)
The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.
The red line inside the yellow boxes is the median quality score for that base.
The blue line is the average quality score at a particular base.

Per Tile Sequence Quality

Next, there is the "Per tile sequence quality" plot. This graph informs whether something is wrong with a tile in the flow cell. Along the horizontal axis are the base positions. The vertical axis represents the tile number in which the read came from.

This plot compares the average quality score of a tile to the average quality score of all tiles at a particular base position. -- https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/

The interpretation for this plot is as follows

Colder colors indicate that the average quality score at a tile is at or above the average quality score of all tiles at a particular base position. So a plot that is entirely blue is good.
Hotter colors indicate that the average quality score at a tile is below the average quality score of all tiles at a particular base position. So a plot with red indicates a part of the flow cell has problems.

Bad per tile quality. Source: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html

Per Sequence Quality Scores

Next, the distribution of the sequence quality scores is shown in the "Per sequence quality scores" plot. This graph can reveal whether the quality of a subset of sequences is poor. Here the sequences have good quality, where most reads have quality that clusters at around 37 or above (~0.02% error likelihood).

Per Base Sequence Content

The next figure shows "Per base sequence content", which is essentially the sequence make up along the bases of reads in the FASTQ file. If a library is random, then the percent composition of each nucleotide base (A,T,C,G) should be the same (~25%).

This module fails for hcc1395_tumor_rep1_R1.fq because the difference between the percentage of A and the percentage of T is larger than 20 at a particular location OR the difference between the percentage of C and the percentage of G is larger than 20 at a particular location -- Babraham bioinformatics Per base sequence content.

In hcc1395_tumor_rep1_R1.fq, it looks like the difference between the percent composition of A and T at base position 2 is causing the failure. Unfortunately, this type of uneveness in base distribution at the beginning of a read is observed in RNA sequencing due to priming with random hexamers during the library preparation stage.

Per Sequence GC Content

Next, the GC content across each sequence compared to a normal distribution in what is shown in the "Per sequence GC content" plot. The GC content in hcc1395_tumor_rep1_R1.fq is off from the normal theoretical distribution.

The peak of this theoretical distribution is an estimate of the GC content of the underlying genome. Deviation from of the GC content from the theoretical distribution could be caused by contamination or sequencing bias. -- Babraham bioinformatics Per base sequence content.

Per Base N Content

Whether there are any unknown bases in the sequencing reads is shown in next plot, which is the "Per base n content".

The sequence length distribution of hcc1395_tumor_rep1_R1.fq in what is known as the "Sequence Length Distribution" plot. All sequences have 151 bases as trimming has not been performed.

Sequence Duplication Level

The plot below shows the sequence duplication levels. High levels of duplication may indicate an enrichment bias such as over-amplification in the PCR step. However, in RNA sequencing, duplicate sequences could be biologically meaningful as these could be derived from multiple copies of the same transcript.

See here for advice regarding handling duplicates in RNA sequencing.

Overrepresented Sequences

The overrepresented sequences graph is shown next. Presence of overrepresented sequences may indicate enrichment artifact or the sequencing library is not very diverse. On the other hand, overrepresented sequences could have biological significance.

Adapter Contamination

If there is adapter contamination in the sequences, then this will be shown in the next plot.

Conclusions from `hcc1395_normal_rep1_R1.fastq` Quality Assessment

In general, FASTQC shows that the file hcc1395_normal_rep1_R1.fastq have good quality sequences (less than 1% error likelihood). The concern is the presence of Illuminal universal adapters.

Merging Multiple FASTQC Reports into One

Reviewing multiple FASTQC reports is cumbersome but there is an application called MultiQC that allows scientists to combine many FASTQC reports into one.

To use MultiQC on Biowulf (or other HPC systems), first load it using the following.

module load multiqc

The --filename option allows users to enter a base name (ie. file name without an extension) for the output MultiQC report. Here, the base name pre_alignment_qc_raw will be used.

multiqc --filename pre_alignment_qc_raw .

ls

MultiQC generates a folder that contains the QC data should the researcher choose to interrogate these programmatically. It provides a report in the html format as well, which can be downloaded to local computer and viewed in a web browser.

pre_alignment_qc_raw_data
pre_alignment_qc_raw.html

Take a look at the MultiQC html report by copying it to the Downloads folder on local computer using scp.

 scp user8@helix.nih.gov:/data/user/hcc1395_b4b/pre_alignment_qc_raw/pre_alignment_qc_raw.html .

Upon opening the MultiQC report, the user will be greeted with a navigation panel (similar to that found in the individual FASTQC reports) that allows users to quickly move to different sections of the report. A link to get help for using the MultiQC report is available as well. To the right of the report page, there is a tool box that allows the researcher to do things like highlighting different samples in different colors for better visualization, rename samples, and export each of the individual QC plots as an image for inclusion in presentations and/or publications. MultiQC reports are interactive.

The General Statistics table shows some basic statistics about the samples, including percentage of duplicate reads, GC content, and number of sequences (reported as million of sequences in the M Seqs column).

Click on the Configure Columns tab to choose which columns to include in the General Statistics table and the plot button to visualize the data in graphical format.

Altering the way the total number of sequences is reported

The number of sequences could be reported as the actual number rather than in millions. Users can alter this by issuing the `--config` option in `multiqc` and specifying a custom report configuration `yaml` file. The content for this configuration file is shown below and can be explained as follows. * `read_count_multiplier`: set to 1 to report the actual sequence number. * `read_count_prefix`: set to "" to remove the units at the end of the number of sequences. The default is "M" for milion. * Under `table_columns_name` change the `total_sequences` column heading to "# Sequences" or any name that reflects the fact the actual number of sequences is being reported.

read_count_multiplier: 1
read_count_prefix: ""
table_columns_name:
  FastQC:
    total_sequences: "# Sequences"

MultiQC Sequence Count

The Sequence Count plot shows the break down of unique and duplicate reads for each FASTQ file. Again, duplication suggests some sort of enrichment bias or in the case of RNA sequencing, it could be that the duplicates are biologically meaningful due to multiple copies of the same trancript being expressed.

MultiQC Mean Quality Score

The average quality score of the sequencing reads in FASTQ files along each base position is shown in the figure below. Regarding the boxes at the top of the QC plots, green means QC passed while orange and red indicate warning and failed, respectively.

Click on the green rectangle labeled with the number of files to filter out samples to see in the plot.

Upon clicking on the green rectangled labeled with the number of files, a tool box will appear and users can click on the "x" next to each file to deselect and remove it from the visualization.

MultiQC Quality Score Distribution

The quality score distributions of each of the FASTQ files are plotted in the next figure.

MultiQC Per Base Sequence Content

An interactive heatmap of the percent composition of each nucleotide base (A,T,C,G) along the bases (horizontal axis) for each of the FASTQ files (vertical axis) is presented next. Hover over a tile to see the corresponding numbers for a given sample. Click on any row in this heatmap to get the base composition plot for just that sample.

MultiQC GC Distribution

The GC distribution for each of the FASTQ files is shown in figure below.

MultiQC Per Base N Content

The percentage of unknown bases (or N) per FASTQ file is available as the next figure.

MultiQC Sequence Length Distribution

As shown below, all FASTQ files for the hcc13995 study have sequences that contain 151 bases.

MultiQC Duplication Level

The percentage of sequences with the duplication level shown on the horizontal axis.

MultiQC Adapter Contamination

The sequences in the hcc1395 data have adapter contamination. Mousing over the area where the adpater content starts to increase suggests Illumina Universal Adapter.

Note

FASTQC will issue a warning if any sequence is present in more than 5% of all reads and failure if any sequence is present in more than 10% of all reads. -- FASTQC manual

MultiQC Status Heatmap

Finally, a heatmap showing the per file QC modules that have passed, warn, or failed is provided.

Lesson 7: Introduction to Next Generation Sequencing (NGS) Data and Quality Control

Lesson 6 Review

Learning Objectives

Basic RNA Sequencing Analysis Workflow

Example Data

Sign onto Biowulf

Request an Interactive Session

Copy the Data to User's `/data` folder

FASTQ Files for `hcc1395_b4b`

Overview of a FASTQ File

Quality Assessment on the `hcc1395` FASTQ Files

Interpreting FASTQC Report

Report Navigator and Summary

Per Base Sequence Quality

Per Tile Sequence Quality

Per Sequence Quality Scores

Per Base Sequence Content

Per Sequence GC Content

Per Base N Content

Sequence Duplication Level

Overrepresented Sequences

Adapter Contamination

Conclusions from `hcc1395_normal_rep1_R1.fastq` Quality Assessment

Merging Multiple FASTQC Reports into One

MultiQC Navigation Panel and Summary Statisics

MultiQC Sequence Count

MultiQC Mean Quality Score

MultiQC Quality Score Distribution

MultiQC Per Base Sequence Content

MultiQC GC Distribution

MultiQC Per Base N Content

MultiQC Sequence Length Distribution

MultiQC Duplication Level

MultiQC Adapter Contamination

MultiQC Status Heatmap

View FASTQC and MultiQC HTML Reports

Untrimmed FASTQ Files

Lesson 7: Introduction to Next Generation Sequencing (NGS) Data and Quality Control

Lesson 6 Review

Learning Objectives

Basic RNA Sequencing Analysis Workflow

Example Data

Sign onto Biowulf

Request an Interactive Session

Copy the Data to User's /data folder

FASTQ Files for hcc1395_b4b

Overview of a FASTQ File

Quality Assessment on the hcc1395 FASTQ Files

Interpreting FASTQC Report

Report Navigator and Summary

Per Base Sequence Quality

Per Tile Sequence Quality

Per Sequence Quality Scores

Per Base Sequence Content

Per Sequence GC Content

Per Base N Content

Sequence Duplication Level

Overrepresented Sequences

Adapter Contamination

Conclusions from hcc1395_normal_rep1_R1.fastq Quality Assessment

Merging Multiple FASTQC Reports into One

MultiQC Navigation Panel and Summary Statisics

MultiQC Sequence Count

MultiQC Mean Quality Score

MultiQC Quality Score Distribution

MultiQC Per Base Sequence Content

MultiQC GC Distribution

MultiQC Per Base N Content

MultiQC Sequence Length Distribution

MultiQC Duplication Level

MultiQC Adapter Contamination

MultiQC Status Heatmap

View FASTQC and MultiQC HTML Reports

Untrimmed FASTQ Files

Copy the Data to User's `/data` folder

FASTQ Files for `hcc1395_b4b`

Quality Assessment on the `hcc1395` FASTQ Files

Conclusions from `hcc1395_normal_rep1_R1.fastq` Quality Assessment