09. Quality trimming of sequence data copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Learn: using trimmomatic to remove low-quality bases from a sequence

Always remember to activate the bioinformatics environment.

conda activate bioinfo

We will be downloading some new sequence data. Create a new directory within your sequences directory and name it "trimming".

mkdir trimming
cd trimming

Download fastq sequences from NCBI/SRA.

fastq-dump --split-files -X 10000 SRR1553607
shows this output
2020-05-15T18:05:34 srapath.2.10.1 int: buffer insufficient while reading uri within cloud module - cannot Get Cloud Location
Read 10000 spots for SRR1553607
Written 10000 spots for SRR1553607
Run trimmomatic in single end mode (SE), set SLIDINGWINDOW:4:30 to use a sliding window of size 4 that will clip the read if the average phred score (quality score) is below 30.
trimmomatic SE SRR1553607_2.fastq trimmed_2.fq SLIDINGWINDOW:4:30
where "SRR1553607_2.fastq" is the name of the input file and "trimmed_2.fq" is the name of the output file.

SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.

SLIDINGWINDOW:: * windowSize: specifies the number of bases to average across * requiredQuality: specifies the average quality required.

Prints the following to the screen.

TrimmomaticSE: Started with arguments:
 SRR1553607_2.fastq trimmed_2.fq SLIDINGWINDOW:4:30
Automatically using 4 threads
Quality encoding detected as phred33
Input Reads: 10000 Surviving: 9277 (92.77%) Dropped: 723 (7.23%)
TrimmomaticSE: Completed successfully
and generate fastqc reports on both datasets (before and after trimming). What differences do you see?
fastqc SRR1553607_2.fastq trimmed_2.fastq
shows this output to the screen
Started analysis of SRR1553607_2.fastq
Approx 10% complete for SRR1553607_2.fastq
Approx 20% complete for SRR1553607_2.fastq
Approx 30% complete for SRR1553607_2.fastq
Approx 40% complete for SRR1553607_2.fastq
Approx 50% complete for SRR1553607_2.fastq
Approx 60% complete for SRR1553607_2.fastq
Approx 70% complete for SRR1553607_2.fastq
Approx 80% complete for SRR1553607_2.fastq
Approx 90% complete for SRR1553607_2.fastq
Approx 100% complete for SRR1553607_2.fastq
Analysis complete for SRR1553607_2.fastq
Started analysis of trimmed_2.fq
Approx 10% complete for trimmed_2.fq
Approx 20% complete for trimmed_2.fq
Approx 30% complete for trimmed_2.fq
Approx 40% complete for trimmed_2.fq
Approx 50% complete for trimmed_2.fq
Approx 65% complete for trimmed_2.fq
Approx 75% complete for trimmed_2.fq
Approx 85% complete for trimmed_2.fq
Approx 95% complete for trimmed_2.fq
Analysis complete for trimmed_2.fq
Using your web browser, choose "File -> Open File" and navigate to the directory where you have your output files. Compare the plots before and after trimming.

SRR1553607_2.fastq (Before trimming)

SRR1553607_2.fastq (After trimming)

Another program to trim sequences is BBDuk (Decontamination using Kmers):

bbduk.sh in=SRR1553607_2.fastq out=bbduk.fq qtrim=r overwrite=true trimq=30

qtrim=r (right end only)

overwrite=true (ow) Grant permission to overwrite files.

Trim read ends to remove bases with quality below trimq.

Values: * rl (trim both ends), * f (neither end), * r (right end only), * l (left end only), * w (sliding window).

Giving this output to the screen

/Users/stonelakeak/miniconda3/envs/bioinfo/opt/bbmap-38.79-0//calcmem.sh: line 75: [: -v: unary operator expected
Max memory cannot be determined.  Attempting to use 1400 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, 
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
java -ea -Xmx1400m -Xms1400m -cp /Users/stonelakeak/miniconda3/envs/bioinfo/opt/bbmap-38.79-0/current/ jgi.BBDuk in=SRR1553607_2.fastq out=bbduk.fq qtrim=r overwrite=true qtrim=30
Executing jgi.BBDuk [in=SRR1553607_2.fastq, out=bbduk.fq, qtrim=r, overwrite=true, qtrim=30]
Version 38.79

0.033 seconds.
Initial:
Memory: max=1468m, total=1468m, free=1439m, used=29m

Input is being processed as unpaired
Started output streams: 0.043 seconds.
Processing time:        0.092 seconds.

Input:                      10000 reads         1010000 bases.
QTrimmed:                   4188 reads (41.88%)     181620 bases (17.98%)
Total Removed:              154 reads (1.54%)   181620 bases (17.98%)
Result:                     9846 reads (98.46%)     828380 bases (82.02%)

Time:                           0.136 seconds.
Reads Processed:       10000    73.37k reads/sec
Bases Processed:       1010k    7.41m bases/sec

Now we can run fastqc on the before and after BBDuk trimmed sequences.

fastqc SRR1553607_2.fastq bbduk.fq

SRR1553607_2.fastq -> bbduk.fq (trimmed with bbduk)


To trim paired-end sequences with trimmomatic:

trimmomatic PE SRR1553607_1.fastq SRR1553607_2.fastq  trimmed_1.fq unpaired_1.fq trimmed_2.fq unpaired_2.fq SLIDINGWINDOW:4:30
credit: Trimmomatic Manual: V0.32


Here's how to do the same with bbduk (see also: BBTools User Guide)

bbduk.sh in1=SRR1553607_1.fastq in2=SRR1553607_2.fastq outm1=bbduk_1.fq out1=unpaired_bb_1.fq outm2=bbduk_2.fq out2=unpaired_bb_2.fq qtrim=r overwrite=true qtrim=30