12. Sequence Patterns copy

This page contains content directly from the Biostar Handbook by Istvan Albert.

Always remember to activate your bioinformatics environment.

conda activate bioinfo

What is a sequence pattern? A sequence pattern is a sequence of bases described by certain rules. These rules can be as simple as a direct match (or partial matches) of a sequence, like:

  • Find locations where ATGC matches with no more than one error
  • Patterns may have more complicated descriptions such as:

All locations where ATA is followed by one or more GCs and no more than five Ts ending with GTA * Patterns can be summarized probabilistically into so called motifs such as:

Locations where a GC is followed by an A 80% of time, or by a T 20% of the time, then is followed by another GC

Adapters can be detected using sequence patterns. Let's get some data. But first, create a directory in your biostar_class directory called "patterns" and go to that directory.

mkdir patterns
cd patterns
Now let's get that data.
fastq-dump -X 10000 SRR1553606 --split-files
cat SRR1553606_1.fastq | grep --color=always CTGTCTCTTATACA | head -2
The "--color=always" option is a great tool for visualizing patterns within sequence data, but don't use it when creating files for downstream analysis as this can affect additional processing.

There are several tools available to search genomic sequences for patterns.

  • grep - searches for lines that contain patterns in an input file, simple patterns and basic regular expressions
  • egrep (extended grep, can handle extended regular expressions)
  • dreg (Emboss Tool) - searches one or more sequences with the supplied regular expression and writes a report file with the matches.
  • fuzznuc (Emboss Tool) - searches for a specified PROSITE-style pattern in nucleotide sequences

-pattern (nucleotides being search for including ambiguities)

-filter (read first file from standard input, write first file to standard output)

Let's fetch some data from the NCBI nucleotide database in fasta format.

efetch -id KU182908 -db nucleotide -format fasta > KU182908.fa
This "grep" will miss patterns wrapping new lines.
cat KU182908.fa | grep --color=always AAAAAA
This "dreg" matches and reports the locations.
cat KU182908.fa | dreg -filter -pattern AAAAAA
To search a pattern with ambiguous N bases, try "fuzznuc".
cat KU182908.fa | fuzznuc -filter -pattern 'AANAA'