Phylogenetics - Back to Basics - Multiple Sequence Alignment

Contributors

Michael Charleston

last_modification Published: May 10, 2024

last_modification Last Updated: May 10, 2024

Motivation

We use sequence alignment to:

identify complex relationships among multiple species - more than just pairwise comparisons;
find homologous parts (sites / loci) in sequences that may be under different selection dynamics;
and to build phylogenetic trees!

Multiple sequence alignment (MSA) is a required step in molecular phylogenetics —

Sequence Alignment

.left[One of the best understood and best solved bioinformatics problem is how to align two sequences.]

.left[In order to do this we need to know:]

what an “alignment” really means;
how to judge how good an alignment is;
an algorithm to do the alignment.

What is an alignment?

Given two sequences like GGGCTGAA and GGGACTG “an alignment” is a mapping of their positions (a.k.a. “sites”) to a common ordering, by inserting gaps in one sequence or another:

Alignment of two DNA sequences GGGACTG and GGCTGAA. In the alignment gaps, indicated by -, are added to the sequences to form GGGACTG-- and -GG-CTGAA so that homologous sites are aligned at positions 2, 3, 5, 6, 7. </p> —

Alignments represent homology

Our goal with sequence alignment is to identify which regions, down to individual positions in a molecular sequence, are homologous: that is, their shared evolutionary history is the same as that of the taxa of interest.

It's like tracing the origin of a set of transcribed documents through all their copies back to the original, where each was only copied (with maybe some mistakes) from one predecessor.

Once we have identified homologous sites then we can analyse their differences and similarities under an evolutionary model - a crucial centrepiece of phylogenetic analysis, and in fact of all comparative analysis of molecular sequences.

Sequences evolve on a tree

Schematic example of a phylogenetic tree where species are represented by short DNA sequences to demonstrate how sequences evolve on a tree. The tree branches multiple times from a common ancestor to five extant taxa. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. An A to G substitution is highlighted in the first branching event. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. A deletion of a T is highlighted in the first branching event. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. An insertion of TT is highlighted in one of the branches. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. An A to G substitution is highlighted in one of the branches. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. An insertion of a T is highlighted in one of the branches. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. Two T to G substitutions have occurred on one branch. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. A C to A substitution has occurred on one branch. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. A deletion of a T has occurred on one branch. —

Sequences evolve on a tree

A schematic of a phylogenetic tree showing the evolution of different DNA sequences from a common ancestral sequence. An A to C substitution has occurred on one branch. —

Let’s align these

Gaps don’t remain in the history so we only have

.image-50[ Five unaligned DNA sequences are arranged vertically. Nucleotides are colour coded G = orange, C = blue, A = green, T = pink ] —

A good alignment

This alignment reflects truth

.image-50[ A possible alignment of five DNA sequences arranged vertically. Gaps have been introduced to align homologous sites. Nucleotides are colour coded G = orange, C = blue, A = green, T = pink ] —

Although…

But this alignment also looks “good” (?)

.image-50[ A possible alignment of five DNA sequences arranged vertically. Gaps have been introduced to align homologous sites. Nucleotides are colour coded G = orange, C = blue, A = green, T = pink ]

Although…

And so does this

.image-50[ A possible alignment of five DNA sequences arranged vertically. Gaps have been introduced to align homologous sites. Nucleotides are colour coded G = orange, C = blue, A = green, T = pink ]

This is nonsensical

Only matches and indels! Win?

.image-50[ A possible alignment of five DNA sequences arranged vertically. Gaps have been introduced to align homologous sites. Nucleotides are colour coded G = orange, C = blue, A = green, T = pink ]

Here, the indels do not make sense as having come from the same phylogenetic history.

Fixing errors

A possible alignment of six DNA sequences arranged vertically drawing attention to a position, 1, where gaps have been introduced in nearly all sequences and, 2, gaps are not aligned but could be.

At 1 a gap has been inserted in EVERY sequence.

Fixing errors

A possible alignment of six DNA sequences arranged vertically drawing attention to a position, 1, where gaps have been introduced in nearly all sequences and, 2, gaps are not aligned but could be.

At 2 the gaps don’t line up, but they could.

Fixing errors

A possible alignment of six DNA sequences arranged vertically drawing attention to a position 1 where gaps have been removed to improve alignment at position 2.

This fixes both problems.

Rating alignments

This can be a bit of an art.
.left[You need to:]

balance the number of gaps with number of mismatches
notice when there are insertions / deletions that don’t make sense, where there are alternative arrangements that do

Pair-wise Alignment

.center[Dynamic Programming Approach]

Sequence Similarity

To do any kind of comparison we need a distance or similarity measure.
Without it we can’t say whether, e.g., these two species are more similar than those two.

Sequence Dissimilarity
- Two main methods first: Hamming and p-distances, counting the number or proportion of differences between sequences
Edit cost
- a measure of the amount of evolutionary “work” that has to be done to change one character state into another one
Likelihood
- under a specific model of evolution, what is the probability that we would observe these sequences?

Hamming Distance

The Hamming distance between two sequences is just the number of differences between them.
It makes no distinction between substitutions or insertions/deletions, and not between transitions and transversions.

.image-40[ An alignment of two DNA sequences. Positions that are not homologous are indicated by an asterisk and assigned the value 1. ]

Sequence length: 20

Number of differences: 5

Hamming Distance = 5

P-distance = 5/20 = 0.25 or 25%

Edit Cost

We use a matrix of costs to describe how much evolutionary “work” must be done to convert one character to another.

.pull-left[

This is the Edit cost matrix.

|-|A|C|G|T|-| |——|——————–| |A|0|1|2|1|5| |C|1|0|1|2|5| |G|2|1|0|1|5| |T|1|2|1|0|5| |-|5|5|5|5|n/a| ] .pull-right[

.image-75[ ![An alignment of two DNA sequences. Positions that are not homologous are indicated by an asterisk and assigned the values 1, 2 or 5.](images/03-msa-06-edit-distance.pdf-1.png) ]

Complete edit-cost = 1 + 1 + 2 + 1 + 5 = 10

]

Using edit costs we can describe better the relationships between sequences: for example it’s less common for an A to change to a G in the above than for an A to change to a T, in turn less common than for an A to remain as an A. –

Also note we have introduced a bigger cost for aligning any nucleotide with a gap, reflecting our believe that insertion/deletion events are less common than substitutions.

What price a gap?

There are two basic methods for assigning a cost c to a gap of length g in a sequence.

Linear cost: \(c = -dg\) where \(d\) is the gap open penalty;

Affine cost: \(c = -d - (g-1)e\) where \(e\) is the gap extension penalty.

Typical values are \(d = 10, e = 0.1\).

The affine gap cost method is the most complex method we can use in order to solve the alignment of two sequences quickly.

More complex models prohibit the use of dynamic programming to solve the alignment (and must use heuristics).

Numbers of alignments

For two sequences of length \(x\) and \(y\), there are \(\frac{(x+y)!}{x!\ y!}\) possible alignments.
For three sequences, of length \(x\), \(y\) and \(z\) say, there are \(\frac{(x+y+z)!}{x!\ y!\ z!}\) alignments.
For \(n\) sequences of length 10, this increases rapidly:

n	Hash alignments
2	184756
3	5.55 × 10^12
4	4.71 × 10^21
5	4.83 × 10^31
6	3.64 × 10^42
7	1.45 × 10^54
8	2.38 × 10^66
9	4.94 × 10^85
10	2.35 × 10^92

In general it is not possible - even with really fast computers - to guarantee optimal multiple alignments, even with simple costing schemes. —

Dynamic Programming

.left[Dynamic Programming (DP) is a common method to solve many types of problems, including pairwise sequence alignment. ]

Solves local problems optimally
Amalgamates these into globally optimal complete solutions
“Fast”

DP overview

Dynamic Programming solves problems by breaking them down recursively into (slightly) smaller problems.

Schematic representation of dynamic programming showing how the process solves a problem by breaking it down and finding optimal partial solutions that can be used to infer the full solution.

In terms of sequence alignment, this comes down to basing alignment of two sequences up to positions i and j in terms of the best alignments yet found for the two sequences, up to positions \(i-1,j-10\), \(i,j-1\), and \(i-1,j\).

#DP overview (cont.)

Schematic representation of how alignment of two DNA sequences progresses. Three scenarios are shown: 1. both sequences are identical 2. the sequences differ and a gap is introduced in the top sequence 3. the sequences differ and a gap is introduced in the bottom sequence.

Dynamic Programming alignment

Array two sequences along the top and left sides of a cost matrix.
Fill in the cells of the matrix from top-left to bottom right. At each stage find the minimum cost sub-alignment and add to it:
2.1 Find the maximum score from the previous cells including gap costs;

2.2 Put the result into this new cell;

2.3 Note which cell we chose with a pointer or reference.

.image-25[ Schematic of the dynamic programming alignment described on the slide ]

We add the cost of the best solution to the previous cells (above, to the left and above-left) to the best possible score for this cell.

#Alignment example

We will align two amino acid sequences next:

Input sequences:

.center[ | x | HEAGAWGHEE | |—|—| | y | PAWHEAE | ]

Filling in the cost matrix

Edit costs for these two sequences:

	H	E	A	G	A	W	G	H	E	E
P	-2	-1	-1	-2	-1	-4	-2	-2	-1	-1
A	-2	-1	5	0	5	-3	0	-2	-1	-1
W	-3	-3	-3	-3	-3	15	-3	-3	-3	-3
H	10	0	-2	-2	-2	-3	-2	10	0	0
E	0	6	-1	-3	-1	-3	-3	0	6	6
A	-2	-1	5	0	5	-3	0	-2	-1	-1
E	0	-6	-1	-3	-1	-3	-3	0	6	6

Dynamic Programming: fill in

Cost matrix of possible alignments of the amino acid sequences HEAGAWGHEE on the x-axis and PAWHEAE on the y axis. Arrows indicate the direction taken to reach each score. —

Dynamic Programming: backtrack

Cost matrix of possible alignments of the amino acid sequences HEAGAWGHEE on the x-axis and PAWHEAE on the y axis. The optimal alignment is indicated in red. Arrows indicate the direction taken to reach each score.

Alignments are paths in the table

Cost matrix of possible alignments of the amino acid sequences HEAGAWGHEE on the x-axis and PAWHEAE on the y axis. Arrows indicate the direction taken to reach this alignment. —

Conclusion of simple alignment

Aligned sequences:


\(s\_{1}\)	H	E	A	G	A	W	G	H	E	-	E
\(s\_{2}\)	-	-	P	-	A	W	-	H	E	A	E
	-5	-5	-1	-5	+5	+15	-5	+10	+6	-5	+6

Total cost = 16

Generalising this approach

Without too much effort it is possible to generalise this approach:

Needleman-Wunsch is used for local alignment;

Smith-Waterman is used for global alignment;

longest match can be found by setting high match costs and large negative mis-match and gap costs;

BLAST uses this system once good patching pairs have been found;

Affine gap scores as mentioned, accounting for gap opening and gap extension penalties to differ.

#Properties of DP Pairwise Alignment

Pairwise Sequence Alignment takes an amount of time that is proportional to the number of cells in the table, which is roughly the product of the lengths of the two sequences.

That means it is \(O(nm)\) if the lengths of the sequences are \(n\) and \(m\); usually as these are about the same we can write \(O(n^{2})\): it’s quadratic in sequence length.
The amount of space required is also quadratic in \(n\). There is a linear-space version of the DP method (which is rarely used as it takes longer).
This solution is globally optimal. It will always produce an optimal alignment, though there may be more than one.
With more sequences, the size of the table increase: For \(k\) sequences the algorithm is \(O(n^{k})\): not practical.

Multiple Sequence Alignment: Aligning groups of sequences using heuristics

.center[Aligning multiple sequences] — #Aligning multiple sequences

We cannot check all possible alignments (there are simply too many), so we must use make a compromise.

We will use pairwise alignment (which is easy) and build up a multiple sequence alignment from pairs of sequences.

These heuristic methods are used in Clustal, GCG and others. —

Progressive Alignment

.left[The process is quite simple:]

Align all pairs of sequences using DP (dynamic programming).
Create a distance matrix based on the alignments.
Form a guide tree from the distance matrix. This is not the same as a phylogenetic tree, nor should it be interpreted as one!
Progressively align the pairs of sequences with DP, creating summary (consensus) sequences as we go.

MAFFT, Muscle, T-COFFEE, k-align
- Clustal is not that great to be honest
- Other very good automatic alignment methods exist, such as Muscle, T-Coffee, k-align, di-align.
- My current favourites are Muscle and MAFFT.

MAFFT

“MAFFT offers various multiple alignment strategies. They are classified into three types, (a) the progressive method, (b) the iterative refinement method with the WSP score, and (c) the iterative refinement method using both the WSP and consistency scores. In general, there is a tradeoff between speed and accuracy. The order of speed is a > b > c, whereas the order of accuracy is a < b < c. The results of benchmarks can be seen here. The following are the detailed procedures for the major options of MAFFT.”

Source: MAFFT is available at https://mafft.cbrc.jp/; first paper 10.1093/nar/gkf436

MAFFT algorithms overview

Flow chart giving an overview of algorithms used by the program MAFFT to convert distance matrices into alignments. The flow chart is described in the video recording at 37:52. —

Anolis example

Screenshot from the program SeeView showing a multiple sequence alignment of Anolis species. DNA sequences are aligned vertically and nucleotides are colour coded. Aligned sites can be identified by solid lines of colour that run from top to bottom of the image. Full description included in the video recording at 40:06.

Around site 982-990 there is a 2 bp gap in all sequences.

Anolis example

Here I have lined up these gaps: more substitutions? but makes sense.

Anolis example

Screenshot from the program SeaView showing a multiple sequence alignment of Anolis species. DNA sequences are aligned vertically and nucleotides are colour coded. Aligned sites can be identified by solid lines of colour that run from top to bottom of the image. Full description included in the video recording at 40:06.

We can remove the gap-only sites now. There’s more to do!

What’s Next?

Once your sequences are properly aligned, they can be used for phylogenetic analysis. —

Thank you!

.center[Next: building trees from distances!]

Thank you!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors! page logo

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

	H	E	A	G	A	W	G	H	E	E
P	-2	-1	-1	-2	-1	-4	-2	-2	-1	-1
A	-2	-1	5	0	5	-3	0	-2	-1	-1
W	-3	-3	-3	-3	-3	15	-3	-3	-3	-3
H	10	0	-2	-2	-2	-3	-2	10	0	0
E	0	6	-1	-3	-1	-3	-3	0	6	6
A	-2	-1	5	0	5	-3	0	-2	-1	-1
E	0	-6	-1	-3	-1	-3	-3	0	6	6

	H	E	A	G	A	W	G	H	E	E
P	-2	-1	-1	-2	-1	-4	-2	-2	-1	-1
A	-2	-1	5	0	5	-3	0	-2	-1	-1
W	-3	-3	-3	-3	-3	15	-3	-3	-3	-3
H	10	0	-2	-2	-2	-3	-2	10	0	0
E	0	6	-1	-3	-1	-3	-3	0	6	6
A	-2	-1	5	0	5	-3	0	-2	-1	-1
E	0	-6	-1	-3	-1	-3	-3	0	6	6

Phylogenetics - Back to Basics - Multiple Sequence Alignment

Contributors

Motivation

Sequence Alignment

an algorithm to do the alignment.

What is an alignment?

Alignments represent homology

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Sequences evolve on a tree

Let’s align these

A good alignment

Although…

Although…

This is nonsensical

Fixing errors

Fixing errors

Fixing errors

Rating alignments

Pair-wise Alignment

Sequence Similarity

Hamming Distance

Edit Cost

What price a gap?

Numbers of alignments

Dynamic Programming

“Fast”

DP overview

In terms of sequence alignment, this comes down to basing alignment of two sequences up to positions i and j in terms of the best alignments yet found for the two sequences, up to positions \(i-1,j-10\), \(i,j-1\), and \(i-1,j\).

Dynamic Programming alignment

Filling in the cost matrix

Dynamic Programming: fill in

Dynamic Programming: backtrack

Alignments are paths in the table

Conclusion of simple alignment

Generalising this approach

Multiple Sequence Alignment: Aligning groups of sequences using heuristics

Progressive Alignment

Progressively align the pairs of sequences with DP, creating summary (consensus) sequences as we go.

MAFFT, Muscle, T-COFFEE, k-align

MAFFT

MAFFT algorithms overview

Anolis example

Anolis example

Here I have lined up these gaps: more substitutions? but makes sense.

Anolis example

What’s Next?

Thank you!

Thank you!

	H	E	A	G	A	W	G	H	E	E
P	-2	-1	-1	-2	-1	-4	-2	-2	-1	-1
A	-2	-1	5	0	5	-3	0	-2	-1	-1
W	-3	-3	-3	-3	-3	15	-3	-3	-3	-3
H	10	0	-2	-2	-2	-3	-2	10	0	0
E	0	6	-1	-3	-1	-3	-3	0	6	6
A	-2	-1	5	0	5	-3	0	-2	-1	-1
E	0	-6	-1	-3	-1	-3	-3	0	6	6