13. Regular expressions copy

This page contains content taken directly from the Biostar Handbook by Istvan Albert.

Activate the bioinformatics environment.

conda activate bioinfo

First let's make a place to store today's work. In your biostar_class directory, create a new directory called "june".

cd biostar_class
mkdir june
cd june

Now we'll download the data from the SRA,

fastq-dump --split-files SRR519926

What are regular expressions? * regex, regexp * a sequence of characters that define a search pattern * can be used to find or replace in strings of characters * double quotes around a string specify a regular expression search

Here is a regular expressions cheat sheet.

grep vs. egrep

grep = Global Regular Expression Print
egrep = extended grep, can be used with regular expressions

Find an ATG anchored at the start of the line. Use the up caret "^" symbol at the start of the pattern you are looking for.

cat SRR519926_1.fastq | egrep "^ATG" --color=always | head

Find an ATG anchored at the end of the line. Use the dollar sign "$" at the end of the pattern you are looking for.

cat SRR519926_1.fastq | egrep "ATG$" --color=always | head

Find TAATA or TATTA patterns. The square brackets "[,]" are used to specify a range of characters.

cat SRR519926_1.fastq | egrep "TA[A,T]TA" --color=always | head

Find TAAATA or TACCTA, these are groups of words. The pipe symbol "|" says find "either this or that". The parentheses () indicates a group.

cat SRR519926_1.fastq | egrep "TA(AA|CC)TA" --color=always | head

How to quantify matches with metacharacters.

Find TA followed by zero or or more A followed by TA.

"*" -> 0 or more

cat SRR519926_1.fastq | egrep "TA(A*)TA" --color=always | head.

Find TA followed by one or or more A followed by TA.

"+" -> 1 or more

cat SRR519926_1.fastq | egrep "TA(A+)TA" --color=always | head

Find TA followed by two to five As followed by TA. The curly brackets "{}" specify a range.

cat SRR519926_1.fastq | egrep "TAA{2,5}TA" --color=always | head

Match Ilumina adaptors at the end of the reads.

Match AGATCGG anywhere followed by any number of bases.

cat SRR519926_1.fastq | egrep "AGATCGG.*" --color=always | head

Get chromosome 22 of the Human Genome

mkdir chr22
cd chr22
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip > chr22.fa

Let’s find the telomeric repeats TTAGGG in the human genome. In principle this is easy: build and match the pattern. In practice, when we search at a genomic scale, things don’t always go as planned.

Let’s first check that this pattern is present. Use the -i flag to make the match case insensitive since the genome may lowercase regions (for example the repeating regions are marked as such in the human genome):

cat chr22.fa | egrep -i '(TTAGGG)' --color=always

The above won’t work perfectly since this is line oriented matching and the genome wraps over many lines. But it is a good start. We can refine it more

cat chr22.fa | egrep -i '(TTAGGG){3,10}' --color=always

Then we can “linearize” the genome by removing the new line wrapping characters from it. That way we can look for the actual pattern:

cat chr22.fa | tr -d '\n' | egrep -o -i '(TTAGGG){20,30}' --color=always

What does tr -d '\n' do in the above command line. Can you guess?

tr is used for translating or deleting characters. The "-d" option deletes characters, in this case the '\n' which stands for "new line". The "-i" flag ignores case so 'agatc' will be found as well as 'AGATC' and 'aGAtc'. The "-o" option tells it to print only the matching portion of the lines. The {20,30} tells it to only find where the pattern is found between 20 and 30 times.