12. Sequence Patterns copy
This page contains content directly from the Biostar Handbook by Istvan Albert.
Always remember to activate your bioinformatics environment.
conda activate bioinfo
What is a sequence pattern? A sequence pattern is a sequence of bases described by certain rules. These rules can be as simple as a direct match (or partial matches) of a sequence, like:
- Find locations where ATGC matches with no more than one error
- Patterns may have more complicated descriptions such as:
All locations where ATA is followed by one or more GCs and no more than five Ts ending with GTA * Patterns can be summarized probabilistically into so called motifs such as:
Locations where a GC is followed by an A 80% of time, or by a T 20% of the time, then is followed by another GC
Adapters can be detected using sequence patterns. Let's get some data. But first, create a directory in your biostar_class directory called "patterns" and go to that directory.
mkdir patterns
cd patterns
fastq-dump -X 10000 SRR1553606 --split-files
cat SRR1553606_1.fastq | grep --color=always CTGTCTCTTATACA | head -2
There are several tools available to search genomic sequences for patterns.
- grep - searches for lines that contain patterns in an input file, simple patterns and basic regular expressions
- egrep (extended grep, can handle extended regular expressions)
- dreg (Emboss Tool) - searches one or more sequences with the supplied regular expression and writes a report file with the matches.
- fuzznuc (Emboss Tool) - searches for a specified PROSITE-style pattern in nucleotide sequences
-pattern (nucleotides being search for including ambiguities)
-filter (read first file from standard input, write first file to standard output)
Let's fetch some data from the NCBI nucleotide database in fasta format.
efetch -id KU182908 -db nucleotide -format fasta > KU182908.fa
cat KU182908.fa | grep --color=always AAAAAA
cat KU182908.fa | dreg -filter -pattern AAAAAA
cat KU182908.fa | fuzznuc -filter -pattern 'AANAA'