13. Regular expressions copy
This page contains content taken directly from the Biostar Handbook by Istvan Albert.
Activate the bioinformatics environment.
conda activate bioinfo
First let's make a place to store today's work. In your biostar_class directory, create a new directory called "june".
cd biostar_class
mkdir june
cd june
fastq-dump --split-files SRR519926
What are regular expressions? * regex, regexp * a sequence of characters that define a search pattern * can be used to find or replace in strings of characters * double quotes around a string specify a regular expression search
Here is a regular expressions cheat sheet.
grep vs. egrep
-
grep = Global Regular Expression Print
-
egrep = extended grep, can be used with regular expressions
Find an ATG anchored at the start of the line. Use the up caret "^" symbol at the start of the pattern you are looking for.
cat SRR519926_1.fastq | egrep "^ATG" --color=always | head
Find an ATG anchored at the end of the line. Use the dollar sign "$" at the end of the pattern you are looking for.
cat SRR519926_1.fastq | egrep "ATG$" --color=always | head
Find TAATA or TATTA patterns. The square brackets "[,]" are used to specify a range of characters.
cat SRR519926_1.fastq | egrep "TA[A,T]TA" --color=always | head
Find TAAATA or TACCTA, these are groups of words. The pipe symbol "|" says find "either this or that". The parentheses () indicates a group.
cat SRR519926_1.fastq | egrep "TA(AA|CC)TA" --color=always | head
How to quantify matches with metacharacters.
Find TA followed by zero or or more A followed by TA.
"*" -> 0 or more
cat SRR519926_1.fastq | egrep "TA(A*)TA" --color=always | head.
"+" -> 1 or more
cat SRR519926_1.fastq | egrep "TA(A+)TA" --color=always | head
cat SRR519926_1.fastq | egrep "TAA{2,5}TA" --color=always | head
Match Ilumina adaptors at the end of the reads.
Match AGATCGG anywhere followed by any number of bases.
cat SRR519926_1.fastq | egrep "AGATCGG.*" --color=always | head
mkdir chr22
cd chr22
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip > chr22.fa
Let’s first check that this pattern is present. Use the -i flag to make the match case insensitive since the genome may lowercase regions (for example the repeating regions are marked as such in the human genome):
cat chr22.fa | egrep -i '(TTAGGG)' --color=always
The above won’t work perfectly since this is line oriented matching and the genome wraps over many lines. But it is a good start. We can refine it more
cat chr22.fa | egrep -i '(TTAGGG){3,10}' --color=always
Then we can “linearize” the genome by removing the new line wrapping characters from it. That way we can look for the actual pattern:
cat chr22.fa | tr -d '\n' | egrep -o -i '(TTAGGG){20,30}' --color=always
What does tr -d '\n' do in the above command line. Can you guess?
tr is used for translating or deleting characters. The "-d" option deletes characters, in this case the '\n' which stands for "new line". The "-i" flag ignores case so 'agatc' will be found as well as 'AGATC' and 'aGAtc'. The "-o" option tells it to print only the matching portion of the lines. The {20,30} tells it to only find where the pattern is found between 20 and 30 times.