Lesson 8: Trimming
Lesson 7 Review
Lesson 7 introduced the FASTQ file, which is the format used to store Next Generation Sequencing (NGS) data. Participants also learned about assessing quality of the sequences in FASTQ files using the tool FASTQC. This step is essential as it will inform whether sequencing is of high quality and if there are contamination such as adapters in the sequences. QC results revealed that while sequencing data quality is good, there are adapters in the sequences. Adapters will interfere with the mapping stage where algorithms are used to determine where in the genome the sequences came from. Thus, the next step is to trim away the adapters and perform QC again on the trimmed data to make sure the contamination has been removed.
Learning objectives
At the end of this session, participants will be able to use the tool Trimmomatic to remove adapters from the NGS data.
Sign onto Biowulf and Request an Interactive Session
Before getting started, sign onto Biowulf and request an interactive session. In the ssh
command below, replace user with the participant's assigned Biowulf student ID.
Change into /data/user/hcc1395_b4b
.
Next, request an interactive session with 12 gb of RAM and 10 gb of local temporary storage. The option --cpus-per-task
is used to request 6 CPUs on Biowulf in the sinteractive
command below in addition to the 12 gb of memory and 10 gb of local temporary storage.
Adapter Trimming with Trimmomatic
Load Trimmomatics
Trimmomatic is a tool that can remove both low quality sequence and adapters. The first step to using Trimmomatic on Biowulf is to load it.
Tip
Other tools used for trimming include bbduk and Cutadapt. Both are capable of quality and adapter trimming.
Make a New Folder to Stored the Trimmed Reads
Stay in the /data/user/hcc1395_b4b
folder for these exercises.
Run Trimmomatic
To run remove adapters for all FASTQ files in one go, the parallel
command will be introduced. This command enables the analyst to run multiple tasks in parallel such as trimming of high throughput sequencing data. The command construct is broken down below.
-
cat
is used to print the hcc1395 samples ids stored in the filehcc1395_sample_ids.txt
. Rather than printing to terminal,|
is used to send the output fromcat
toparallel
.- The Trimmomatic construct is enclosed in double quotes within the
parallel
command. The components are as follows.-j
: enables users to specify how many jobs to run in parallel (6 in this case since there are 6 samples).java -jar $TRIMMOJAR
: when running on Trimmomatics on Biowulf, start with this. It essentially runs the Trimmomatic Java archive located in the folder pointed to by environmental variable$TRIMMOJAR
.PE
: this next piece tells Trimmomatic to expect paired end sequencing.phred33
: this indicates the quality score encoding, which is used by modern Illumina sequencers.- The input FASTQ files are specified next. Because the input FASTQ files are in the
reads
folder, this must be included in the path.{}
in the file path acts as a place holder to store the sample ids sent bycat
. All users have to do is to add the_R1
and_R2
part. - The outputs are specified next. These will be written to the
trimmed_reads
folder. Trimmed versions will be labeled with "_trimmed". The files labeled "unpaired" will store sequences where only one of the pair met the trimming threshold. - The adapter file (see
references/illumina_multiplex.fa
) is specified after theILLUMINACLIP
argument followed by some parameters to help Trimmomatic decide whether a portion of a sequence matches an adapter and continue with trimming. See the Trimmomatic manual to learn more about options and parameters - The
MINLEN
argument allows users to specify a sequence length threshold after trimming. If the length of the sequence post trimming is shorter than this number than it will be discarded. The threshhold is set to 25 bases here. Short sequences may also interfere with alignment as they could be aligned to multiple spots in a genome.
- The Trimmomatic construct is enclosed in double quotes within the
cat hcc1395_sample_ids.txt | parallel -j 6 "java -jar $TRIMMOJAR PE -phred33 reads/{}_R1.fq reads/{}_R2.fq trimmed_reads/{}_trimmed_R1.fq trimmed_reads/{}_unpaired_R1.fq trimmed_reads/{}_trimmed_R2.fq trimmed_reads/{}_unpaired_R2.fq ILLUMINACLIP:references/illumina_multiplex.fa:2:30:5 MINLEN:25"
Note
When running the parallel
command, users will see the message below regarding citation and donation to the creators. To turn off the citation, use parallel --citation
.
Academic tradition requires you to cite works you base your article on. If you use programs that use GNU Parallel to process data for an article in a scientific publication, please cite:
Tange, O. (2024, December 22). GNU Parallel 20241222 ('Bashar'). Zenodo. https://doi.org/10.5281/zenodo.14550073
This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
More about funding GNU Parallel and the citation notice: https://www.gnu.org/software/parallel/parallel_design.html#citation-notice
To silence this citation notice: run 'parallel --citation' once.
Come on: You have run parallel 166 times. Isn't it about time you run 'parallel --citation' once to silence the citation notice?
Run QC on Trimmed FASTQ Files
Make a directory pre_alignment_qc_trimmed
to store the QC results for the adapter trimmed FASTQ files.
Then load FASTQC and MultiQC.
In the fastqc
construct below, specify the path to the trimmed FASTQ files, which are located in the folder trimmed_reads
. Then use *trimmed*.fq
to get FASTQC to check all of the trimmed FASTQ files (*
is used as a wild card). Write the results into the pre_alignment_qc_trimmed
folder using the -o
option.
Next, combine the FASTQC reports for the trimmed data into one using MultiQC. In the multiqc
command below, specify the path the FASTQC reports (pre_alignment_qc_trimmed
) and then use --filename
option to assign a base name to the report (ie. hcc1395_multiqc
, which will write a the report to the file hcc1395_multiqc.html
).
Then use scp
to copy hcc1395_multiqc.html
to local Downloads
folder.
Adapter Trimming Conclusion
Adapter trimming did not influence the quality scores of the FASTQ files but sequences were cleaned of adapter contamination.