18. BLAST copy

This page uses content directly from the Biostar Handbook by Istvan Albert.

Remember to activate the bioinformatics environment and create a directory for today's work.

conda activate bioinfo
mkdir blast
cd blast

What is BLAST? Basic Local Alignment Search Tool

Primary purpose of BLAST is to search a collection of target sequences that match a query sequence.
See the BLAST interface at NCBI
BLAST tools (blastn, blastp, blastx, tblastn, tblastx and more)
Each tool can be further customized with flags (options) such as "-task".
If a task is not specified at the command line, BLAST will run with defaults settings.

Fundamental concepts of BLAST

A search may take place in nucleotide space, protein space or translated spaces where nucleotides are translated into proteins.
Searches may implement search “strategies”: optimizations to a specific task.
Different search strategies will produce different alignments.
Searches use alignments that rely on scoring matrices

BLAST for pairwise alignments (aligning two sequences to one another).

How similar are these sequences?

# The Ebola nucleoprotein in the 1997 strain
efetch -db protein -id AAD14590 --format fasta > AAD14590.fa

# The Ebola nucleoprotein in the 2018 strain
efetch -db protein -id ARG44037 --format fasta > ARG44037.fa

# Run a pairwise BLAST alignment
blastp -query AAD14590.fa -subject ARG44037.fa`

Here are the results printed to the screen.

BLASTP 2.9.0+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402

Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.


Database: User specified sequence set (Input: ARG44037).
           1 sequences; 739 total letters


Query= AAD14590.1 nucleoprotein [Ebola virus - Mayinga, Zaire, 1976]

Length=739
                                                                      Score        E
Sequences producing significant alignments:                          (Bits)     Value

ARG44037.1 nucleoprotein [Ebola virus]                                1512       0.0  


> ARG44037.1 nucleoprotein [Ebola virus]
Length=739

 Score = 1512 bits (3915),  Expect = 0.0, Method: Compositional matrix adjust.
Identities = 728/739 (99%), Positives = 730/739 (99%), Gaps = 0/739 (0%)

Query  1    MDSRPQKIWMAPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAF  60
            MDSRPQK+WM PSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAF
Sbjct  1    MDSRPQKVWMTPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAF  60

Query  61   EAGVDFQESADSFLLMLCLHHAYQGDYKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEEL  120
            EAGVDFQESADSFLLMLCLHHAYQGD+KLFLESGAVKYLEGHGFRFEVKKRDGVKRLEEL
Sbjct  61   EAGVDFQESADSFLLMLCLHHAYQGDHKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEEL  120

Query  121  LPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHA  180
            LPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHA
Sbjct  121  LPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHA  180

Query  181  EQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQAR  240
            EQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQAR
Sbjct  181  EQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQAR  240

Query  241  FSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLL  300
            FSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLL
Sbjct  241  FSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLL  300

Query  301  NLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAES  360
            NLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAES
Sbjct  301  NLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAES  360

Query  361  RELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGH  420
            RELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSG 
Sbjct  361  RELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGP  420

Query  421  YDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAP  480
            YDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAP
Sbjct  421  YDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAP  480

Query  481  DDLVLFDLDEDDEDTKPVPNRSTKGGQQKNSQKGQHIEGRQTQSRPIQNVPGPHRTIHHA  540
            DDLVLFDLDEDDEDTKPVPNR TKGGQQKNSQKGQH EGRQTQSRP QNVPGP RTIHHA
Sbjct  481  DDLVLFDLDEDDEDTKPVPNRLTKGGQQKNSQKGQHTEGRQTQSRPTQNVPGPRRTIHHA  540

Query  541  SAPLTDNDRRNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDGTSNR  600
            SAPLTDNDR NEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRD TSNR
Sbjct  541  SAPLTDNDRGNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDETSNR  600

Query  601  TPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQSEHSFEEMYRHILRSQG  660
            TPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQ EHSFEEMYRHILRSQG
Sbjct  601  TPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQPEHSFEEMYRHILRSQG  660

Query  661  PFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQF  720
            PFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQF
Sbjct  661  PFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQF  720

Query  721  YWPVMNHKNKFMAILQHHQ  739
            YWPVMNHKNKFMAILQHHQ
Sbjct  721  YWPVMNHKNKFMAILQHHQ  739



Lambda      K        H        a         alpha
   0.314    0.131    0.378    0.792     4.96 

Gapped
Lambda      K        H        a         alpha    sigma
   0.267   0.0410    0.140     1.90     42.6     43.6 

Effective search space used: 488601


  Database: User specified sequence set (Input: ARG44037).
    Posted date:  Unknown
  Number of letters in database: 739
  Number of sequences in database:  1



Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40

Creating a custom BLAST database

Another way to run BLAST is to create your own custom database, and search against it with query sequences.

Usually you will not have to create your own blast database, you can download prebuilt databases from the NCBI webpage. Here at NIH they are also stored on HPC Biowulf, where you can also run BLAST at the command line.

[NCBI also maintains BLAST and databases on Amazon Web Services, where they are available for a very small cost.]

Let's run BLAST on a custom database of Ebola viral sequence data.

Use "makeblastdb" command to create a BLAST database.
Choose your tool - blastn for nucleotides or blastp for protein sequences.
Run the tool and format output.

# Get the Ebola genome as genbank.
efetch -db nuccore -id AF086833 -format gb  > AF086833.gb

# Look through the file for gene name VP35 and context around it.
cat AF086833.gb | grep -A 1  VP35

The "-A 1" flag tells grep how many lines of output to show.

Here is what we're looking for:

/product="VP35"
/protein_id="AAD14582.1"

Based on the protein_id, we can retrieve the protein sequence of the VP35 gene.

efetch -db protein -id AAD14582 -format fasta  > AAD14582.fa

We can use this link to get the data from Redo: Genomic surveillance elucidates Ebola virus origin.

Let's create a new directory for this protein sequence database.

mkdir -p db

Remember the "-p" flag is to "create intermediate directories as required". We want the data we are downloading to be structured the same way as the source.

And then we'll get the data - this is the Ebola virus proteome.

esearch -db protein -query PRJNA257197 | efetch -format fasta > db/proteins.fa

To get a look at the data, we can use the "seqkit" program to get some statistics

seqkit stats db/proteins.fa

To see this output:

file            format  type     num_seqs    sum_len  min_len  avg_len  max_len
db/proteins.fa  FASTA   Protein     2,240  1,367,025      217    610.3    2,212

So, the BLAST query seq is AAD14582.fa and the db we are searching against is in db/proteins.fa

Let's make the BLAST database.

makeblastdb -dbtype prot -in db/proteins.fa -out db/proteins

and we get this printed to the screen.

Building a new DB, current time: 07/06/2020 14:14:50
New DB name:   /Users/stonelakeak/teaching/biostar_class/blast/ebola/db/proteins
New DB title:  db/proteins.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2240 sequences in 0.0972621 seconds.

Now we can run the blastp tool.

blastp -db db/proteins -query AAD14582.fa > results.txt

If we look at the line count of the results, we see there are greater than 8000 lines in the results.

cat results.txt | wc -l

The results can be formatted in different ways to make them more readable.

blastp -db db/proteins -query AAD14582.fa -outfmt 7 | head

and we get the results like this.

# BLASTP 2.9.0+
# Query: AAD14582.1 VP35 [Ebola virus - Mayinga, Zaire, 1976]
# Database: db/proteins
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 249 hits found
AAD14582.1  AIG95885.1  98.824  340 4   0   1   340 340 0.0 702
AAD14582.1  AIG95894.1  98.824  340 4   0   1   340 340 0.0 702
AAD14582.1  AIG95903.1  98.824  340 4   0   1   340 340 0.0 702
AAD14582.1  AIG95912.1  98.824  340 4   0   1   340 340 0.0 702
AAD14582.1  AIG95921.1  98.824  340 4   0   1   340 340 0.0 702

You can get more info about the different ways to format BLAST output here, or by running "blastn -help" at the command line.

To see the best alignments and worst alignments, we can check the "head" and "tail" of the data.

blastp -db db/proteins -query AAD14582.fa -outfmt "6 qseqid sseqid pident" | sort -k3 -rn | head -5
blastp -db db/proteins -query AAD14582.fa -outfmt "6 qseqid sseqid pident" | sort -k3 -rn | tail -5

The "-k3" specifies we are searching on the value in the third column (percent identity), "-rn" specifies a reverse sort in numeric order (highest to lowest).

Blast Terminology

Query: the sequence that we are looking to match.
Target: the database (collection of sequences) that we are searching for matches.
Subject: the sequence entry in the target that produced an alignment.
Score: the alignment score.
E-Value: the Expect value (E) describes the number of hits better than the current hit that one can “expect” to see just by chance when searching a database of a particular size.

More on e-values

Defined as the number of "hits" expected by chance when searching a particular size database
The smaller the e-value, the better
Depend on size and content of database

What are BLAST tasks?

For blastn, the tasks available are:

blastn - finds more divergent sequences
megablast - finds less divergent sequences (this is the default)
blastn-short - short queries

We will look at how the different BLAST tasks function, by building another custom database and doing some queries.

Building another custom BLAST database

We will build a database out of all features of the 2014 Ebola genome under accession number KM233118. This data will go into a new directory named "db_2014".

mkdir -p db_2014
# Get the 2014 Ebola genome as GenBank file.
efetch -db nucleotide -id KM233118 --format gb > db_2014/KM233118.gb

# Get the 2014 Ebola genome as a FASTA file.
efetch -db nucleotide -id KM233118 --format fasta > db_2014/KM233118.fa

# Extract all features of the GenBank file as separate sequences.
cat db_2014/KM233118.gb | extractfeat -filter -describe gene > db_2014/KM233118-features.fa

extractfeat is an EMBOSS tool, we can find out more about how it works with

extractfeat --help

We can run some statistics on the db we created

seqkit stats db_2014/KM233118-features.fa

to see this

file                          format  type  num_seqs  sum_len  min_len  avg_len  max_len
db_2014/KM233118-features.fa  FASTA   DNA         40   71,382        9  1,784.6   18,613

Now to construct the database:

makeblastdb -dbtype nucl -in db_2014/KM233118-features.fa -out db_2014/KM233118-features

gets this message.

Building a new DB, current time: 07/06/2020 16:24:01
New DB name:   /Users/stonelakeak/teaching/biostar_class/blast/ebola/db_2014/KM-features
New DB title:  db_2014/KM233118-features.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 40 sequences in 0.0174501 seconds.

To run queries against this db that we made:

# Take the first sequence, keep the first 45 bases, rename the sequence to test, make it upper case, put it in query.fa
cat db_2014/KM233118-features.fa | seqret -filter -firstonly -sbegin 1 -send 45 -sid test -supper > query.fa

# Look at the query file.
cat query.fa

You can use:

seqret --help --verbose

to understand all the options for the "seqret" command.

Now we'll run blastn with this query against the db we made.

blastn -db db_2014/KM233118-features -query query.fa

Here is the first hit.

>KM233118_1_2905 [mRNA] (gene="NP") Zaire ebolavirus isolate Ebola 
virus/H.sapiens-wt/SLE/2014/Makona-NM042.3, complete genome.
Length=2905

 Score = 84.2 bits (45),  Expect = 1e-19
 Identities = 45/45 (100%), Gaps = 0/45 (0%)
 Strand=Plus/Plus

Query  1   AATCATACCTGGTTTGTTTCAGAGCCATATCACCAAGATAGAGAA  45
           |||||||||||||||||||||||||||||||||||||||||||||
Sbjct  1   AATCATACCTGGTTTGTTTCAGAGCCATATCACCAAGATAGAGAA  45

Output can be reformatted into a tabular display using the "-outfmt" option.

blastn -db db_2014/KM233118-features -query query.fa  -outfmt 7

to get this.

# BLASTN 2.9.0+
# Query: test [source] Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-NM042.3, complete genome.
# Database: db_2014/KM233118-features
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 3 hits found
test    KM233118_1_2905 100.000 45  0   0   1   45  1   45  9.75e-20    84.2
test    KM233118_1_2905 100.000 45  0   0   1   45  1   45  9.75e-20    84.2
test    KM233118_1_18613    100.000 45  0   0   1   45  45  9.75e-20    84.2
# BLAST processed 1 queries

You can choose which fields you would like to see.

 qseqid means Query Seq-id
      qgi means Query GI
     qacc means Query accesion
  qaccver means Query accesion.version
     qlen means Query sequence length
   sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
       sgi means Subject GI
    sallgi means All subject GIs

and modify your query.

blastn -db db_2014//KM233118-features -query query.fa -outfmt "6 qseqid sseqid pident"

to see this.

test    KM233118_1_2905 100.000
test    KM233118_1_2905 100.000
test    KM233118_1_18613    100.000

Now we're going to make a short query sequence to look at the different functioning between "blastn" and "blastn-short".

cat db_2014/KM233118-features.fa | seqret -filter -firstonly -sbegin 1 -send 12 -sid test -supper > short.fa

we can take a look at the result with "less".

less short.fa

and we can run a blastn search with this short query sequence.

blastn -db db_2014/KM233118-features -query short.fa

which gets us

Query= test [source] Zaire ebolavirus isolate Ebola
virus/H.sapiens-wt/SLE/2014/Makona-NM042.3, complete genome.

Length=12


***** No hits found *****

We can try using options "blastn" and "blastn-short" to see if we get a result. (Remember the default option for blastn is megablast.)

Megablast is intended for comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast.

Discontiguous megablast uses an initial seed that ignores some bases (allowing mismatches) and is intended for cross-species comparisons.

blastn is slow, but allows a word-size down to seven bases.

blastn -db db_2014/KM233118-features -query short.fa -task blastn
blastn -db db_2014/KM233118-features -query short.fa -task blastn-short

Can we get this to work with an even shorter query sequence? First let's create the query.

echo ">mini" > mini.fa
echo "AATCATA" >> mini.fa

The "echo" command repeats what you tell it to repeat. The first "echo" statement creates the description line of the file and the second "echo" statement creates the second line of the file.

Which of these command lines will produce a hit?

blastn -db db_2014/KM233118-features -query mini.fa -task blastn
blastn -db db_2014/KM233118-features -query mini.fa -task blastn-short

By default, BLAST defaults are set to the most common use cases. For example, it is set to mask repetitive regions. Let's look at an example.

Let's download the yeast genome in fasta format.

mkdir -p yeast
efetch -id NC_001133 -db nucleotide -format fasta > yeast/NC_001133.fa

Now we'll build the yeast genome database

makeblastdb -dbtype nucl -in yeast/NC_001133.fa  -out yeast/NC_001133

and create a short sequence from the beginning of the yeast genome.

head -2 yeast/NC_001133.fa > start.fa

Let's check the file contents.

less start.fa

and query the yeast genome for this sequence with blastn.

Database: yeast/NC_001133.fa
           1 sequences; 230,218 total letters

Query= NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete
sequence

Length=70

***** No hits found *****

How is it possible that the sequence was not found when we took it from the beginning of the yeast genome?

Even if we try doing a pairwise alignment, we get no match.

blastn -subject yeast/NC_001133.fa -query start.fa

Let's try to align with a different tool.

local-align.sh start.fa yeast/NC_001133.fa

This is what we get.

########################################
# Program: water
# Rundate: Tue  7 Jul 2020 11:21:11
# Commandline: water
#    [-asequence] start.fa
#    [-bsequence] yeast/NC_001133.fa
#    -filter
# Align_format: srspair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: NC_001133.9
# 2: NC_001133.9
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 70
# Identity:      70/70 (100.0%)
# Similarity:    70/70 (100.0%)
# Gaps:           0/70 ( 0.0%)
# Score: 350.0
# 
#
#=======================================

NC_001133.9        1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC     50
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
NC_001133.9        1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC     50

NC_001133.9       51 CACACACACACATCCTAACA     70
                     ||||||||||||||||||||
NC_001133.9       51 CACACACACACATCCTAACA     70


#---------------------------------------
#---------------------------------------

So why didn't the BLAST search work? By default, the BLAST algorithms block repetitive sequences. To change this behavior, we need to use a different setting on the "--dust" option.

blastn -query start.fa -subject yeast/NC_001133.fa -dust no

and now we get the alignment we are expecting. ```Sequences producing significant alignments: (Bits) Value

NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete... 130 7e-33

NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence Length=230218

Score = 130 bits (70), Expect = 7e-33 Identities = 70/70 (100%), Gaps = 0/70 (0%) Strand=Plus/Plus

Query 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA 60

Query 61 CATCCTAACA 70 |||||||||| Sbjct 61 CATCCTAACA 70

You can always check the default settings of BLAST by using "-help".

blastn -help ``` See also BLAST glossary, where you will find that DUST is a program for filtering regions of low complexity.