10. Removing adapters from sequence data copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Learn * What are sequence adapters? * Do we need to trim them before alignment? * How can I trim with a new adapter sequence?
Be sure to activate the bioinfo environment.
conda activate bioinfo
- Sequencing adapters are short (30bp+) DNA oligos attached to 5' and 3' ends of each cDNA sequence.
- Adapters can appear in the sequence data and are detected by FASTQC.
To view the adapters used by the FASTQC tool:
cd ~/miniconda3/envs/bioinfo/opt/fastqc-0.11.9/Configuration
ls -lh
less adapter_list.txt
# lots of other text
# For the time being it's going to be easier to interpret this plot if all
# of the sequences provided are the same length, so we've gone with 12bp
# fragments for now.
Illumina Universal Adapter AGATCGGAAGAG
Illumina Small RNA 3' Adapter TGGAATTCTCGG
Illumina Small RNA 5' Adapter GATCGTCGGACT
Nextera Transposase Sequence CTGTCTCTTATA
SOLID Small RNA Adapter CGCCTTGGCCGT
-
Adapter trimming can be done before or after quality trimming.
-
With genomes that are known (human, mouse) high-throughput aligner programs automatically trim off adapter sequences during the alignment phase.
-
When working with new genomes, adapter sequences can interfere with the alignment and should be removed.
Trimming with a new adapter
- You can customize the adapter list by adding your own adapter sequences.
- First make a new directory and get some sequences.
and run FASTQC
cd biostar_class mkdir test cd test fastq-dump -X 1000 SRR1553606 --split-files
Open the SRR1553606_1_fastqc.html file using your web browser and "File -> Open File"fastqc SRR1553606_1.fastq
Adapter Content
Let's add a new adapter sequence to the list.
nano nextera_adapter.fa
>nextera
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
trimmomatic SE SRR1553606_1.fastq output.fq ILLUMINACLIP:nextera_adapter.fa:2:30:5
gets this output to the screen:
TrimmomaticSE: Started with arguments:
SRR1553606_1.fastq output.fq ILLUMINACLIP:nextera_adapter.fa:2:30:5
Automatically using 4 threads
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
ILLUMINACLIP: Using 0 prefix pairs, 1 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 1000 Surviving: 833 (83.30%) Dropped: 167 (16.70%)
TrimmomaticSE: Completed successfully
- ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
ILLUMINACLIP:fastaWithAdaptersEtc:seed mismatches:palindrome clip threshold:simple clip threshold
- fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc.
- seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
- palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
Now run FASTQC
fastqc output.fq
