18. BLAST copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Remember to activate the bioinformatics environment and create a directory for today's work.
conda activate bioinfo
mkdir blast
cd blast
What is BLAST? Basic Local Alignment Search Tool
-
Primary purpose of BLAST is to search a collection of target sequences that match a query sequence.
-
See the BLAST interface at NCBI
-
BLAST tools (blastn, blastp, blastx, tblastn, tblastx and more)
-
Each tool can be further customized with flags (options) such as "-task".
-
If a task is not specified at the command line, BLAST will run with defaults settings.
Fundamental concepts of BLAST
- A search may take place in nucleotide space, protein space or translated spaces where nucleotides are translated into proteins.
- Searches may implement search “strategies”: optimizations to a specific task.
- Different search strategies will produce different alignments.
- Searches use alignments that rely on scoring matrices
BLAST for pairwise alignments (aligning two sequences to one another).
How similar are these sequences?
# The Ebola nucleoprotein in the 1997 strain
efetch -db protein -id AAD14590 --format fasta > AAD14590.fa
# The Ebola nucleoprotein in the 2018 strain
efetch -db protein -id ARG44037 --format fasta > ARG44037.fa
# Run a pairwise BLAST alignment
blastp -query AAD14590.fa -subject ARG44037.fa`
BLASTP 2.9.0+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402
Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.
Database: User specified sequence set (Input: ARG44037).
1 sequences; 739 total letters
Query= AAD14590.1 nucleoprotein [Ebola virus - Mayinga, Zaire, 1976]
Length=739
Score E
Sequences producing significant alignments: (Bits) Value
ARG44037.1 nucleoprotein [Ebola virus] 1512 0.0
> ARG44037.1 nucleoprotein [Ebola virus]
Length=739
Score = 1512 bits (3915), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 728/739 (99%), Positives = 730/739 (99%), Gaps = 0/739 (0%)
Query 1 MDSRPQKIWMAPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAF 60
MDSRPQK+WM PSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAF
Sbjct 1 MDSRPQKVWMTPSLTESDMDYHKILTAGLSVQQGIVRQRVIPVYQVNNLEEICQLIIQAF 60
Query 61 EAGVDFQESADSFLLMLCLHHAYQGDYKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEEL 120
EAGVDFQESADSFLLMLCLHHAYQGD+KLFLESGAVKYLEGHGFRFEVKKRDGVKRLEEL
Sbjct 61 EAGVDFQESADSFLLMLCLHHAYQGDHKLFLESGAVKYLEGHGFRFEVKKRDGVKRLEEL 120
Query 121 LPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHA 180
LPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHA
Sbjct 121 LPAVSSGKNIKRTLAAMPEEETTEANAGQFLSFASLFLPKLVVGEKACLEKVQRQIQVHA 180
Query 181 EQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQAR 240
EQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQAR
Sbjct 181 EQGLIQYPTAWQSVGHMMVIFRLMRTNFLIKFLLIHQGMHMVAGHDANDAVISNSVAQAR 240
Query 241 FSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLL 300
FSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLL
Sbjct 241 FSGLLIVKTVLDHILQKTERGVRLHPLARTAKVKNEVNSFKAALSSLAKHGEYAPFARLL 300
Query 301 NLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAES 360
NLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAES
Sbjct 301 NLSGVNNLEHGLFPQLSAIALGVATAHGSTLAGVNVGEQYQQLREAATEAEKQLQQYAES 360
Query 361 RELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGH 420
RELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSG
Sbjct 361 RELDHLGLDDQEKKILMNFHQKKNEISFQQTNAMVTLRKERLAKLTEAITAASLPKTSGP 420
Query 421 YDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAP 480
YDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAP
Sbjct 421 YDDDDDIPFPGPINDDDNPGHQDDDPTDSQDTTIPDVVVDPDDGSYGEYQSYSENGMNAP 480
Query 481 DDLVLFDLDEDDEDTKPVPNRSTKGGQQKNSQKGQHIEGRQTQSRPIQNVPGPHRTIHHA 540
DDLVLFDLDEDDEDTKPVPNR TKGGQQKNSQKGQH EGRQTQSRP QNVPGP RTIHHA
Sbjct 481 DDLVLFDLDEDDEDTKPVPNRLTKGGQQKNSQKGQHTEGRQTQSRPTQNVPGPRRTIHHA 540
Query 541 SAPLTDNDRRNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDGTSNR 600
SAPLTDNDR NEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRD TSNR
Sbjct 541 SAPLTDNDRGNEPSGSTSPRMLTPINEEADPLDDADDETSSLPPLESDDEEQDRDETSNR 600
Query 601 TPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQSEHSFEEMYRHILRSQG 660
TPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQ EHSFEEMYRHILRSQG
Sbjct 601 TPTVAPPAPVYRDHSEKKELPQDEQQDQDHTQEARNQDSDNTQPEHSFEEMYRHILRSQG 660
Query 661 PFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQF 720
PFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQF
Sbjct 661 PFDAVLYYHMMKDEPVVFSTSDGKEYTYPDSLEEEYPPWLTEKEAMNEENRFVTLDGQQF 720
Query 721 YWPVMNHKNKFMAILQHHQ 739
YWPVMNHKNKFMAILQHHQ
Sbjct 721 YWPVMNHKNKFMAILQHHQ 739
Lambda K H a alpha
0.314 0.131 0.378 0.792 4.96
Gapped
Lambda K H a alpha sigma
0.267 0.0410 0.140 1.90 42.6 43.6
Effective search space used: 488601
Database: User specified sequence set (Input: ARG44037).
Posted date: Unknown
Number of letters in database: 739
Number of sequences in database: 1
Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Neighboring words threshold: 11
Window for multiple hits: 40
Creating a custom BLAST database
Another way to run BLAST is to create your own custom database, and search against it with query sequences.
Usually you will not have to create your own blast database, you can download prebuilt databases from the NCBI webpage. Here at NIH they are also stored on HPC Biowulf, where you can also run BLAST at the command line.
[NCBI also maintains BLAST and databases on Amazon Web Services, where they are available for a very small cost.]
Let's run BLAST on a custom database of Ebola viral sequence data.
- Use "makeblastdb" command to create a BLAST database.
- Choose your tool - blastn for nucleotides or blastp for protein sequences.
- Run the tool and format output.
# Get the Ebola genome as genbank.
efetch -db nuccore -id AF086833 -format gb > AF086833.gb
# Look through the file for gene name VP35 and context around it.
cat AF086833.gb | grep -A 1 VP35
The "-A 1" flag tells grep how many lines of output to show.
Here is what we're looking for:
/product="VP35"
/protein_id="AAD14582.1"
Based on the protein_id, we can retrieve the protein sequence of the VP35 gene.
efetch -db protein -id AAD14582 -format fasta > AAD14582.fa
Let's create a new directory for this protein sequence database.
mkdir -p db
And then we'll get the data - this is the Ebola virus proteome.
esearch -db protein -query PRJNA257197 | efetch -format fasta > db/proteins.fa
To get a look at the data, we can use the "seqkit" program to get some statistics
seqkit stats db/proteins.fa
file format type num_seqs sum_len min_len avg_len max_len
db/proteins.fa FASTA Protein 2,240 1,367,025 217 610.3 2,212
Let's make the BLAST database.
makeblastdb -dbtype prot -in db/proteins.fa -out db/proteins
Building a new DB, current time: 07/06/2020 14:14:50
New DB name: /Users/stonelakeak/teaching/biostar_class/blast/ebola/db/proteins
New DB title: db/proteins.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 2240 sequences in 0.0972621 seconds.
blastp -db db/proteins -query AAD14582.fa > results.txt
If we look at the line count of the results, we see there are greater than 8000 lines in the results.
cat results.txt | wc -l
blastp -db db/proteins -query AAD14582.fa -outfmt 7 | head
# BLASTP 2.9.0+
# Query: AAD14582.1 VP35 [Ebola virus - Mayinga, Zaire, 1976]
# Database: db/proteins
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 249 hits found
AAD14582.1 AIG95885.1 98.824 340 4 0 1 340 340 0.0 702
AAD14582.1 AIG95894.1 98.824 340 4 0 1 340 340 0.0 702
AAD14582.1 AIG95903.1 98.824 340 4 0 1 340 340 0.0 702
AAD14582.1 AIG95912.1 98.824 340 4 0 1 340 340 0.0 702
AAD14582.1 AIG95921.1 98.824 340 4 0 1 340 340 0.0 702
To see the best alignments and worst alignments, we can check the "head" and "tail" of the data.
blastp -db db/proteins -query AAD14582.fa -outfmt "6 qseqid sseqid pident" | sort -k3 -rn | head -5
blastp -db db/proteins -query AAD14582.fa -outfmt "6 qseqid sseqid pident" | sort -k3 -rn | tail -5
The "-k3" specifies we are searching on the value in the third column (percent identity), "-rn" specifies a reverse sort in numeric order (highest to lowest).
Blast Terminology
- Query: the sequence that we are looking to match.
- Target: the database (collection of sequences) that we are searching for matches.
- Subject: the sequence entry in the target that produced an alignment.
- Score: the alignment score.
- E-Value: the Expect value (E) describes the number of hits better than the current hit that one can “expect” to see just by chance when searching a database of a particular size.
More on e-values
- Defined as the number of "hits" expected by chance when searching a particular size database
- The smaller the e-value, the better
- Depend on size and content of database
What are BLAST tasks?
For blastn, the tasks available are:
- blastn - finds more divergent sequences
- megablast - finds less divergent sequences (this is the default)
- blastn-short - short queries
We will look at how the different BLAST tasks function, by building another custom database and doing some queries.
Building another custom BLAST database
We will build a database out of all features of the 2014 Ebola genome under accession number KM233118. This data will go into a new directory named "db_2014".
mkdir -p db_2014
# Get the 2014 Ebola genome as GenBank file.
efetch -db nucleotide -id KM233118 --format gb > db_2014/KM233118.gb
# Get the 2014 Ebola genome as a FASTA file.
efetch -db nucleotide -id KM233118 --format fasta > db_2014/KM233118.fa
# Extract all features of the GenBank file as separate sequences.
cat db_2014/KM233118.gb | extractfeat -filter -describe gene > db_2014/KM233118-features.fa
extractfeat is an EMBOSS tool, we can find out more about how it works with
extractfeat --help
seqkit stats db_2014/KM233118-features.fa
file format type num_seqs sum_len min_len avg_len max_len
db_2014/KM233118-features.fa FASTA DNA 40 71,382 9 1,784.6 18,613
makeblastdb -dbtype nucl -in db_2014/KM233118-features.fa -out db_2014/KM233118-features
Building a new DB, current time: 07/06/2020 16:24:01
New DB name: /Users/stonelakeak/teaching/biostar_class/blast/ebola/db_2014/KM-features
New DB title: db_2014/KM233118-features.fa
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 40 sequences in 0.0174501 seconds.
To run queries against this db that we made:
# Take the first sequence, keep the first 45 bases, rename the sequence to test, make it upper case, put it in query.fa
cat db_2014/KM233118-features.fa | seqret -filter -firstonly -sbegin 1 -send 45 -sid test -supper > query.fa
# Look at the query file.
cat query.fa
seqret --help --verbose
Now we'll run blastn with this query against the db we made.
blastn -db db_2014/KM233118-features -query query.fa
>KM233118_1_2905 [mRNA] (gene="NP") Zaire ebolavirus isolate Ebola
virus/H.sapiens-wt/SLE/2014/Makona-NM042.3, complete genome.
Length=2905
Score = 84.2 bits (45), Expect = 1e-19
Identities = 45/45 (100%), Gaps = 0/45 (0%)
Strand=Plus/Plus
Query 1 AATCATACCTGGTTTGTTTCAGAGCCATATCACCAAGATAGAGAA 45
|||||||||||||||||||||||||||||||||||||||||||||
Sbjct 1 AATCATACCTGGTTTGTTTCAGAGCCATATCACCAAGATAGAGAA 45
blastn -db db_2014/KM233118-features -query query.fa -outfmt 7
# BLASTN 2.9.0+
# Query: test [source] Zaire ebolavirus isolate Ebola virus/H.sapiens-wt/SLE/2014/Makona-NM042.3, complete genome.
# Database: db_2014/KM233118-features
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 3 hits found
test KM233118_1_2905 100.000 45 0 0 1 45 1 45 9.75e-20 84.2
test KM233118_1_2905 100.000 45 0 0 1 45 1 45 9.75e-20 84.2
test KM233118_1_18613 100.000 45 0 0 1 45 45 9.75e-20 84.2
# BLAST processed 1 queries
qseqid means Query Seq-id
qgi means Query GI
qacc means Query accesion
qaccver means Query accesion.version
qlen means Query sequence length
sseqid means Subject Seq-id
sallseqid means All subject Seq-id(s), separated by a ';'
sgi means Subject GI
sallgi means All subject GIs
blastn -db db_2014//KM233118-features -query query.fa -outfmt "6 qseqid sseqid pident"
test KM233118_1_2905 100.000
test KM233118_1_2905 100.000
test KM233118_1_18613 100.000
cat db_2014/KM233118-features.fa | seqret -filter -firstonly -sbegin 1 -send 12 -sid test -supper > short.fa
less short.fa
blastn -db db_2014/KM233118-features -query short.fa
Query= test [source] Zaire ebolavirus isolate Ebola
virus/H.sapiens-wt/SLE/2014/Makona-NM042.3, complete genome.
Length=12
***** No hits found *****
We can try using options "blastn" and "blastn-short" to see if we get a result. (Remember the default option for blastn is megablast.)
Megablast is intended for comparing a query to closely related sequences and works best if the target percent identity is 95% or more but is very fast.
Discontiguous megablast uses an initial seed that ignores some bases (allowing mismatches) and is intended for cross-species comparisons.
blastn is slow, but allows a word-size down to seven bases.
blastn -db db_2014/KM233118-features -query short.fa -task blastn
blastn -db db_2014/KM233118-features -query short.fa -task blastn-short
echo ">mini" > mini.fa
echo "AATCATA" >> mini.fa
Which of these command lines will produce a hit?
blastn -db db_2014/KM233118-features -query mini.fa -task blastn
blastn -db db_2014/KM233118-features -query mini.fa -task blastn-short
By default, BLAST defaults are set to the most common use cases. For example, it is set to mask repetitive regions. Let's look at an example.
Let's download the yeast genome in fasta format.
mkdir -p yeast
efetch -id NC_001133 -db nucleotide -format fasta > yeast/NC_001133.fa
makeblastdb -dbtype nucl -in yeast/NC_001133.fa -out yeast/NC_001133
head -2 yeast/NC_001133.fa > start.fa
less start.fa
Database: yeast/NC_001133.fa
1 sequences; 230,218 total letters
Query= NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete
sequence
Length=70
***** No hits found *****
Even if we try doing a pairwise alignment, we get no match.
blastn -subject yeast/NC_001133.fa -query start.fa
Let's try to align with a different tool.
local-align.sh start.fa yeast/NC_001133.fa
########################################
# Program: water
# Rundate: Tue 7 Jul 2020 11:21:11
# Commandline: water
# [-asequence] start.fa
# [-bsequence] yeast/NC_001133.fa
# -filter
# Align_format: srspair
# Report_file: stdout
########################################
#=======================================
#
# Aligned_sequences: 2
# 1: NC_001133.9
# 2: NC_001133.9
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 70
# Identity: 70/70 (100.0%)
# Similarity: 70/70 (100.0%)
# Gaps: 0/70 ( 0.0%)
# Score: 350.0
#
#
#=======================================
NC_001133.9 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC 50
||||||||||||||||||||||||||||||||||||||||||||||||||
NC_001133.9 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC 50
NC_001133.9 51 CACACACACACATCCTAACA 70
||||||||||||||||||||
NC_001133.9 51 CACACACACACATCCTAACA 70
#---------------------------------------
#---------------------------------------
blastn -query start.fa -subject yeast/NC_001133.fa -dust no
NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete... 130 7e-33
NC_001133.9 Saccharomyces cerevisiae S288C chromosome I, complete sequence Length=230218
Score = 130 bits (70), Expect = 7e-33 Identities = 70/70 (100%), Gaps = 0/70 (0%) Strand=Plus/Plus
Query 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA 60 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA 60
Query 61 CATCCTAACA 70 |||||||||| Sbjct 61 CATCCTAACA 70
You can always check the default settings of BLAST by using "-help".