20. Multiple Sequence Aligners copy

This page uses content directly from the Biostars Handbook by Istvan Albert.

Always remember to activate the bioinformatics environment.

conda activate bioinfo

How to align more than two sequences?

Let's download the Ebola genomes.

mkdir -p ebola
esearch -db nuccore -query PRJNA257197 | efetch -format fasta > genomes/ebola.fa

and check the number of sequences with seqkit.

seqkit stat genomes/ebola.fa

and here are the results. We've got all 249 sequences.

file              format  type  num_seqs    sum_len  min_len   avg_len  max_len
genomes/ebola.fa  FASTA   DNA        249  4,705,429   18,613  18,897.3   18,959

Now, we'll pull out the first 10 sequence ids by using "cut" at spaces.

seqkit seq -n genomes/ebola.fa | cut -f 1 -d ' ' | head -10 > ids.txt

and see what we've got.

less ids.txt

Here's what's in the file.

KR105345.1
KR105328.1
KR105323.1
KR105302.1
KR105295.1
KR105294.1
KR105282.1
KR105266.1
KR105263.1
KR105253.1

We'll use these sequence ids to pull out the corresponding sequences.

seqkit grep --pattern-file ids.txt genomes/ebola.fa > small.fa

Let's see what we got.

less small.fa | wc -l

and compare to the original file

less genomes/ebola.fa | wc -l

Need to install mafft.

Get the install packages here:

https://mafft.cbrc.jp/alignment/software/source.html

and follow the instructions for install.

Run the mafft program on the small file of Ebola genomes we created.

mafft --clustalout small.fa > alignment.maf

Let's take a look at the first few and last few lines in this alignment file.

head -20 alignment.maf
tail -20 alignment.maf

Look at Clustal Omega online.