Lesson 8: Cleaning and Preparing Next Generation Sequencing (NGS) Data for Downstream Analysis

Lesson 7 Review

Lesson 7 introduced the FASTQ file, which is the format used to store Next Generation Sequencing (NGS) data. Participants also learned about assessing quality of the sequences in FASTQ files using the tool FASTQC. This step is essential as it will inform whether sequencing is of high quality and if there are contamination such as adapters in the sequences. QC results revealed that while sequencing data quality is good, there are adapters in the sequences. Adapters will interfere with the mapping stage where algorithms are used to determine where in the genome the sequences came from. Thus, the next step is to trim away the adapters and perform QC again on the trimmed data to make sure the contamination has been removed.

Learning objectives

At the end of this session, participants will be able to use the tool Trimmomatic to remove adapters from the NGS data.

Sign onto Biowulf and Request an Interactive Session

Before getting started, sign onto Biowulf and request an interactive session. In the ssh command below, replace user with the participant's assigned Biowulf student ID.

ssh user@biowulf.nih.gov

Change into /data/user/hcc1395_b4b.

cd /data/user/hcc1395_b4b

Next, request an interactive session with 12 gb of RAM and 10 gb of local temporary storage. The option --cpus-per-task is used to request 6 CPUs on Biowulf in the sinteractive command below in addition to the 12 gb of memory and 10 gb of local temporary storage.

sinteractive  --cpus-per-task 6 --mem=12gb --gres=lscratch:10

Adapter Trimming with Trimmomatic

Load Trimmomatics

Trimmomatic is a tool that can remove both low quality sequence and adapters. The first step to using Trimmomatic on Biowulf is to load it.

module load trimmomatic

Tip

Other tools used for trimming include bbduk and Cutadapt. Both are capable of quality and adapter trimming.

Make a New Folder to Stored the Trimmed Reads

mkdir trimmed_reads

Stay in the /data/user/hcc1395_b4b folder for these exercises.

Run Trimmomatic

To run remove adapters for all FASTQ files in one go, the parallel command will be introduced. This command enables the analyst to run multiple tasks in parallel such as trimming of high throughput sequencing data. The command construct is broken down below.

cat is used to print the hcc1395 samples ids stored in the file hcc1395_sample_ids.txt. Rather than printing to terminal, | is used to send the output from cat to parallel.
- The Trimmomatic construct is enclosed in double quotes within the parallel command. The components are as follows.
  - -j: enables users to specify how many jobs to run in parallel (6 in this case since there are 6 samples).
  - java -jar $TRIMMOJAR: when running on Trimmomatics on Biowulf, start with this. It essentially runs the Trimmomatic Java archive located in the folder pointed to by environmental variable $TRIMMOJAR.
  - PE: this next piece tells Trimmomatic to expect paired end sequencing.
  - phred33: this indicates the quality score encoding, which is used by modern Illumina sequencers.
  - The input FASTQ files are specified next. Because the input FASTQ files are in the reads folder, this must be included in the path. {} in the file path acts as a place holder to store the sample ids sent by cat. All users have to do is to add the _R1 and _R2 part.
  - The outputs are specified next. These will be written to the trimmed_reads folder. Trimmed versions will be labeled with "_trimmed". The files labeled "unpaired" will store sequences where only one of the pair met the trimming threshold.
  - The adapter file (see references/illumina_multiplex.fa) is specified after the ILLUMINACLIP argument followed by some parameters to help Trimmomatic decide whether a portion of a sequence matches an adapter and continue with trimming. See the Trimmomatic manual to learn more about options and parameters
  - The MINLEN argument allows users to specify a sequence length threshold after trimming. If the length of the sequence post trimming is shorter than this number than it will be discarded. The threshhold is set to 25 bases here. Short sequences may also interfere with alignment as they could be aligned to multiple spots in a genome.

cat hcc1395_sample_ids.txt | parallel -j 6 "java -jar $TRIMMOJAR PE -phred33 reads/{}_R1.fq reads/{}_R2.fq trimmed_reads/{}_trimmed_R1.fq trimmed_reads/{}_unpaired_R1.fq trimmed_reads/{}_trimmed_R2.fq trimmed_reads/{}_unpaired_R2.fq ILLUMINACLIP:references/illumina_multiplex.fa:2:30:5 MINLEN:25"

Note

When running the parallel command, users will see the message below regarding citation and donation to the creators. To turn off the citation, use parallel --citation.

Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite:

Tange, O. (2024, December 22). GNU Parallel 20241222 ('Bashar'). Zenodo. https://doi.org/10.5281/zenodo.14550073

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice: https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

Come on: You have run parallel 166 times. Isn't it about time you run 'parallel --citation' once to silence the citation notice?

Run QC on Trimmed FASTQ Files

Make a directory pre_alignment_qc_trimmed to store the QC results for the adapter trimmed FASTQ files.

mkdir pre_alignment_qc_trimmed

Then load FASTQC and MultiQC.

module load fastqc

module load multiqc

In the fastqc construct below, specify the path to the trimmed FASTQ files, which are located in the folder trimmed_reads. Then use *trimmed*.fq to get FASTQC to check all of the trimmed FASTQ files (* is used as a wild card). Write the results into the pre_alignment_qc_trimmed folder using the -o option.

fastqc trimmed_reads/*trimmed*.fq -o pre_alignment_qc_trimmed/

Next, combine the FASTQC reports for the trimmed data into one using MultiQC. In the multiqc command below, specify the path the FASTQC reports (pre_alignment_qc_trimmed) and then use --filename option to assign a base name to the report (ie. hcc1395_multiqc, which will write a the report to the file hcc1395_multiqc.html).

multiqc pre_alignment_qc_trimmed/ --filename hcc1395_multiqc

Then use scp to copy hcc1395_multiqc.html to local Downloads folder.

scp user@helix.nih.gov:/data/user/hcc1395_b4b/hcc1395_multiqc.html .

Adapter Trimming Conclusion

Adapter trimming did not influence the quality scores of the FASTQ files but sequences were cleaned of adapter contamination.