10. Removing adapters from sequence data copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Learn * What are sequence adapters? * Do we need to trim them before alignment? * How can I trim with a new adapter sequence?

Be sure to activate the bioinfo environment.

conda activate bioinfo

  • Sequencing adapters are short (30bp+) DNA oligos attached to 5' and 3' ends of each cDNA sequence.
  • Adapters can appear in the sequence data and are detected by FASTQC.

To view the adapters used by the FASTQC tool:

cd ~/miniconda3/envs/bioinfo/opt/fastqc-0.11.9/Configuration
ls -lh
less adapter_list.txt
you will see
# lots of other text

# For the time being it's going to be easier to interpret this plot if all
# of the sequences provided are the same length, so we've gone with 12bp
# fragments for now.

Illumina Universal Adapter                                      AGATCGGAAGAG
Illumina Small RNA 3' Adapter                           TGGAATTCTCGG
Illumina Small RNA 5' Adapter                           GATCGTCGGACT
Nextera Transposase Sequence                            CTGTCTCTTATA
SOLID Small RNA Adapter                                         CGCCTTGGCCGT

  • Adapter trimming can be done before or after quality trimming.

  • With genomes that are known (human, mouse) high-throughput aligner programs automatically trim off adapter sequences during the alignment phase.

  • When working with new genomes, adapter sequences can interfere with the alignment and should be removed.


Trimming with a new adapter

  • You can customize the adapter list by adding your own adapter sequences.
  • First make a new directory and get some sequences.
    cd biostar_class
    mkdir test
    cd test
    fastq-dump -X 1000 SRR1553606 --split-files
    
    and run FASTQC
    fastqc SRR1553606_1.fastq
    
    Open the SRR1553606_1_fastqc.html file using your web browser and "File -> Open File"

Adapter Content

Let's add a new adapter sequence to the list.

nano nextera_adapter.fa
>nextera
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC

trimmomatic SE SRR1553606_1.fastq output.fq ILLUMINACLIP:nextera_adapter.fa:2:30:5

gets this output to the screen:

TrimmomaticSE: Started with arguments:
 SRR1553606_1.fastq output.fq ILLUMINACLIP:nextera_adapter.fa:2:30:5
Automatically using 4 threads
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 1000 Surviving: 833 (83.30%) Dropped: 167 (16.70%)
TrimmomaticSE: Completed successfully
From the Trimmomatic Manual:

  • ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.

ILLUMINACLIP:fastaWithAdaptersEtc:seed mismatches:palindrome clip threshold:simple clip threshold

  • fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc.
  • seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
  • palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.

Now run FASTQC

fastqc output.fq
Adapter Content