16. Sequence Alignments copy
This page uses content directly from the Biostar Handbook by Istvan Albert.
Learning objectives: 1. Understand what a sequence alignment is and how different algorithms can effect alignments. 2. Learn how scoring matrices and gap penalties (gap open, gap extend) are used. 3. Tell the difference between local and global. 4. Interpret the results of alignments.
Remember to activate the bioinformatics environment and move to your class directory. Make a folder called "seq_align" for today's work.
conda activate bioinfo
cd biostar_class
mkdir seq_align
cd seq_align
What is a sequence alignment and what are they used for?
-
Pairwise sequence alignment can identify similar regions between (protein or nucleotide) sequences than can indicate functional, structural or evolutionary relationships.
-
Sequence alignment is arranging two sequences in such a way that similar regions line up. There are multiple ways to align two sequences.
-
Alignments are used to find similar regions between two sequences and to see which sequence is most similar to a query (input) sequence.
-
Alignments are determined by the scoring algorithms that are used (local, global, semi-global). Different algorithms can produce different alignments for the same sequences. Different scoring (gap open, gap extend, matrices) will give different alignments for the same sequences.
-
For very similar sequences, the results will be "robust" and could be reproduced using a variety of different algorithms and settings. For dissimilar sequences, choices made in algorithm and scoring will have a larger impact on the resulting alignments.
Alignments are usually displayed something like thisGATTACA GATTACA GATTACA ||| | ||| || || | || GATCA-- GAT--CA GA-T-CA
ATGCAAATGACAAAT-CGA |||| |||.||.| ||| ATGC---TGATAACTGCGA
- Where the "-" indicates a gap,
- the "|" indicates a match,
- and the "." indicates a mismatch or
- a less dissimilar ":" mismatch.
Alignments are scored when the algorithm assigns a value to matches, mismatches, gaps and gap extensions.
For example, we could choose the following scores:
- 5 points for a match.
- -4 points for a mismatch.
- -10 points for opening a gap.
- -0.5 points for extending an open gap.
Here is an example of scoring.
GATTACA GATTACA GATTACA
|||.| ||| || || | ||
GATCA-- GAT--CA GA-T-CA
5.5 14.5 5
To download and view the NCBI EDNAFULL nucleotide scoring matrix.
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/NUC.4.4
-> Write output to a local file named like the remote file we get.
What is a scoring matrix and what are they used for?
Scoring matrices are used to determine the scores made by matching two characters (nucleotides, amino acids) in a sequence alignment. (genomevolution.org) The purpose is to match the most similar elements of two sequences.
Other ways to quantify alignments include percent identity, percent similarity, e-values and mapping quality.
The CIGAR string (Compact Idiosyncratic Gapped Alignment Report) is an alignment format used within the Sequence Alignment Map (SAM) files that form the backbone of most bioinformatics high-throughput analyses. For the same alignment from above:
ATGCAAATGACAAATAC
|||| |||.||.|
ATGC---TGATAACT--
4M3D3M1X2M1X1M2D We read out this “idiosyncratic” construct like so 4M + 3D + 3M + 1X + 2M + 1X + 1M + 2D:
4 matches followed by 3 deletions, 3 matches, 1 mismatch, 2 matches, 1 mismatch, 1 match, 2 deletions. The format above is in a so-called “Extended CIGAR,” meaning that it employs the X symbol for mismatches.
We will begin by downloading two scripts (from biostarhandbook, using scripts in EMBOSS package) and putting them in the "~/bin" directory we create. The "-p" option with "mkdir" tells the program to create intermediate directories as required.
mkdir -p ~/bin
curl http://data.biostarhandbook.com/align/global-align.sh > ~/bin/global-align.sh
curl http://data.biostarhandbook.com/align/local-align.sh > ~/bin/local-align.sh
To see the permissions for a file
ls -l
What are permissions? (content from datacarpentries.org Introduction to the Command line for Genomics)
The first column you see when viewing permissions is file type.
- "-" for an ordinary file
- "d" for directory
So here is how we change permissions on the "align" scripts in the "bin" directory.
chmod +x ~/bin/*-align.sh
The Smith-Waterman algorithm (1981) performs local sequence alignment, it looks for similar regions between two nucleotide or protein sequences by comparing all possible length segments and optimizing.
Global alignments are used to find similarities over the entire length of both sequences.
The Needleman-Wunsch algorithm (1970) is used to align protein or nucleotide sequences by dividing the sequences into smaller sequences to find an optimal solution. It gives a score to all possible alignments, finding the one with the highest score.
Question: How do we decide if we should use local or global alignment?
Answer (Des): It depends on your input sequences. Local alignments are useful for finding regions of similarity between dissimilar sequences. Global alignments do more of an end-to-end alignment between two sequences. If you have two very dissimilar sequences you'll probably get a lot of gaps between your two sequences when you do a global alignment. A local alignment might be good for finding similar domains in two unrelated proteins; or maybe trying to find a transcription factor binding site in two promoter sequences.
Now try running the script. Try some variations to see how the alignment changes.
local-align.sh THISLINE ISALIGNED
local-align.sh ISTHISLINE ISALIGNED
local-align.sh ISTHISLINE ISTHISLINEALIGNED
local-align.sh string1.fa string2.fa
string1.fa
>NP_001298013.1 superoxide dismutase [Cu-Zn] 2 [Solanum lycopersicum]
MVKAVAVLNSSEGVSGTILFTQDGDAPTTVNGNISGLKPGLHGFHVHALGDTTNGCMSTGPHYNPAGKEH
GAPEDEVRHAGDLGNITVGEDGTASFTITDKQIPLTGPQSIIGRAVVVHADPDDLGKGGHELSKSTGNAG
GRIACGIIGLQG
string2.fa
>NP_000445.1 superoxide dismutase [Cu-Zn] [Homo sapiens]
MATKAVCVLKGDGPVQGIINFEQKESNGPVKVWGSIKGLTEGLHGFHVHEFGDNTAGCTSAGPHFNPLSR
KHGGPKDEERHVGDLGNVTADKDGVADVSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGN
AGSRLACGVIGIAQ
Next, we will download some BLOSUM matrices from NCBI and use them in our alignments.
BLOSUM - Blocks Substitution Matrices (Henikoff & Henikoff), are used for protein alignments, and are based on local alignments. BLOSUM matrices are derived by comparing blocks of amino acids. Higher number matrices (BLOSUM80) are used for more closely (evolutionarily) related proteins, while lower number matrices (BLOSUM45) are used for more distantly related proteins.
See more info here.
https://en.wikipedia.org/wiki/BLOSUM
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/BLOSUM30
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/BLOSUM62
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/BLOSUM80
Parameters for local-align.sh and global-align.sh
-gapopen (a penalty for opening a gap in one of the sequences for a better overall alignment)
-gapextend (is a penalty for extending the length of a gap)
-data (is used to pass a scoring matrix to the aligner)
Local alignments Try a few different sequence alignments with different matrices.
local-align.sh string1.fa string2.fa --data BLOSUM30 -gapopen 5
local-align.sh string1.fa string2.fa --data BLOSUM80 -gapopen 5
BLOSUM30
#
# Aligned_sequences: 2
# 1: NP_001298013.1
# 2: NP_000445.1
# Matrix: BLOSUM30
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 157
# Identity: 87/157 (55.4%)
# Similarity: 112/157 (71.3%)
# Gaps: 9/157 ( 5.7%)
# Score: 647.5
NP_001298013. 1 M-VKAVAVLNSSEG-VSGTILFTQ---DGDAPTTVNGNISGLKPGLHGFH 45
| :|||.|| ..:| |.|.|.|.| :| |:.|.|.|.||.:||||||
NP_000445.1 1 MATKAVCVL-KGDGPVQGIINFEQKESNG--PVKVWGSIKGLTEGLHGFH 47
NP_001298013. 46 VHALGDTTNGCMSTGPHYNPAGKEHGAPEDEVRHAGDLGNITVGEDGTAS 95
||.:||:|.||.|:|||:||..::||.|:||.||:|||||:|:.:||:|.
NP_000445.1 48 VHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVAD 97
NP_001298013. 96 FTITDKQIPLTGPQSIIGRAVVVHADPDDLGKGGHELSKSTGNAGGRIAC 145
::|.|..|.|:|...||||::|||...|||||||.|.|..|||||.|:||
NP_000445.1 98 VSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLAC 147
NP_001298013. 146 GIIGL-Q 151
|:||: |
NP_000445.1 148 GVIGIAQ 154
BLOSUM80
#
# Aligned_sequences: 2
# 1: NP_001298013.1
# 2: NP_000445.1
# Matrix: BLOSUM80
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 159
# Identity: 88/159 (55.3%)
# Similarity: 105/159 (66.0%)
# Gaps: 13/159 ( 8.2%)
# Score: 696.5
#
#
#=======================================
NP_001298013. 1 M-VKAVAVLNSSEG-VSGTILFTQ---DGDAPTTVNGNISGLKPGLHGFH 45
| .|||.||. .:| |.|.|.|.| :| |..|.|:|.||..||||||
NP_000445.1 1 MATKAVCVLK-GDGPVQGIINFEQKESNG--PVKVWGSIKGLTEGLHGFH 47
NP_001298013. 46 VHALGDTTNGCMSTGPHYNPAGKEHGAPEDEVRHAGDLGNITVGEDGTAS 95
||..||.|.||.|.|||:||..::||.|:||.||.|||||:|..:||.|.
NP_000445.1 48 VHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVAD 97
NP_001298013. 96 FTITDKQIPLTGPQSIIGRAVVVH--ADPDDLGKGGHELSKSTGNAGGRI 143
.:|.|..|.|:|.:.||||.:||| | |||||||:|.|..|||||.|:
NP_000445.1 98 VSIEDSVISLSGDHCIIGRTLVVHEKA--DDLGKGGNEESTKTGNAGSRL 145
NP_001298013. 144 ACGIIGL-Q 151
|||:||: |
NP_000445.1 146 ACGVIGIAQ 154
Remember * the - character may indicate a “gap” (space), * the | character is used to display a match, * the . character may be used to display a mismatch, * the : character may display a more conservative mismatch.
Global alignments
Now let's try some global alignments
Will global alignment give us different results on our two SOD proteins than the local alignment?
global-align.sh string1.fa string2.fa --data BLOSUM30 -gapopen 5
global-align.sh string1.fa string2.fa --data BLOSUM80 -gapopen 5
Aligned_sequences: 2
# 1: NP_001298013.1
# 2: NP_000445.1
# Matrix: BLOSUM30
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 158
# Identity: 86/158 (54.4%)
# Similarity: 112/158 (70.9%)
# Gaps: 10/158 ( 6.3%)
# Score: 647.5A
NP_001298013. 1 -MVKAVAVLNSSEG-VSGTILFTQ---DGDAPTTVNGNISGLKPGLHGFH 45
::|||.|| ..:| |.|.|.|.| :| |:.|.|.|.||.:||||||
NP_000445.1 1 MATKAVCVL-KGDGPVQGIINFEQKESNG--PVKVWGSIKGLTEGLHGFH 47
NP_001298013. 46 VHALGDTTNGCMSTGPHYNPAGKEHGAPEDEVRHAGDLGNITVGEDGTAS 95
||.:||:|.||.|:|||:||..::||.|:||.||:|||||:|:.:||:|.
NP_000445.1 48 VHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVAD 97
NP_001298013. 96 FTITDKQIPLTGPQSIIGRAVVVHADPDDLGKGGHELSKSTGNAGGRIAC 145
::|.|..|.|:|...||||::|||...|||||||.|.|..|||||.|:||
NP_000445.1 98 VSIEDSVISLSGDHCIIGRTLVVHEKADDLGKGGNEESTKTGNAGSRLAC 147
NP_001298013. 146 GIIGL-QG 152
|:||: |
NP_000445.1 148 GVIGIAQ- 154
Aligned_sequences: 2
# 1: NP_001298013.1
# 2: NP_000445.1
# Matrix: BLOSUM80
# Gap_penalty: 5.0
# Extend_penalty: 0.5
#
# Length: 160
# Identity: 88/160 (55.0%)
# Similarity: 105/160 (65.6%)
# Gaps: 14/160 ( 8.8%)
# Score: 696.5
NP_001298013. 1 M-VKAVAVLNSSEG-VSGTILFTQ---DGDAPTTVNGNISGLKPGLHGFH 45
| .|||.||. .:| |.|.|.|.| :| |..|.|:|.||..||||||
NP_000445.1 1 MATKAVCVLK-GDGPVQGIINFEQKESNG--PVKVWGSIKGLTEGLHGFH 47
NP_001298013. 46 VHALGDTTNGCMSTGPHYNPAGKEHGAPEDEVRHAGDLGNITVGEDGTAS 95
||..||.|.||.|.|||:||..::||.|:||.||.|||||:|..:||.|.
NP_000445.1 48 VHEFGDNTAGCTSAGPHFNPLSRKHGGPKDEERHVGDLGNVTADKDGVAD 97
NP_001298013. 96 FTITDKQIPLTGPQSIIGRAVVVH--ADPDDLGKGGHELSKSTGNAGGRI 143
.:|.|..|.|:|.:.||||.:||| | |||||||:|.|..|||||.|:
NP_000445.1 98 VSIEDSVISLSGDHCIIGRTLVVHEKA--DDLGKGGNEESTKTGNAGSRL 145
NP_001298013. 144 ACGIIGL-QG 152
|||:||: |
NP_000445.1 146 ACGVIGIAQ- 154
Why are scoring values integer numbers?
You could wonder, rightfully so, how come the scores are all integers? And also what does 3 vs. 5 mean in a scoring matrix?
In a nutshell, the score reflects a probability. Besides, the scores are represented as log 2 odds. A substitution score of 3 means 2^3=8 whereas a substitution score of 5 means 2^5=32, thus a four-fold increase of probability relative to one another.
Thus a substitution with a score of 3 is four times more likely to occur than a change with a score of 5. For simplicity and to keep our sanity the probabilities log2 odds were all rounded to nearest integers.
Note how the score is not absolute; they are relative to one another. Any other number pairs with the same ratios would have the same effect.
And remember:
Alignment reliability also depends on the information content of the aligned sequence itself.
Alignments that operate on low complexity regions (CCCCCCCCCCC) generally produce less reliable variation calls.