version 3.6
ProML -- Protein Maximum Likelihood program
(C) Copyright 1986-2002 by the University of Washington. Written by
Joseph Felsenstein. Permission is granted to copy this document
provided that no fee is charged for it and that this copyright notice
is not removed.
This program implements the maximum likelihood method for protein
amino acid sequences. It uses the either the Jones-Taylor-Thornton or
the Dayhoff probability model of change between amino acids. The
assumptions of these present models are:
1. Each position in the sequence evolves independently.
2. Different lineages evolve independently.
3. Each position undergoes substitution at an expected rate which is
chosen from a series of rates (each with a probability of
occurrence) which we specify.
4. All relevant positions are included in the sequence, not just
those that have changed or those that are "phylogenetically
informative".
5. The probabilities of change between amino acids are given by the
model of Jones, Taylor, and Thornton (1992) or by the PAM model of
Dayhoff (Dayhoff and Eck, 1968; Dayhoff et. al., 1979).
Note the assumption that we are looking at all positions, including
those that have not changed at all. It is important not to restrict
attention to some positions based on whether or not they have changed;
doing that would bias branch lengths by making them too long, and that
in turn would cause the method to misinterpret the meaning of those
positions that had changed.
This program uses a Hidden Markov Model (HMM) method of inferring
different rates of evolution at different amino acid positions. This
was described in a paper by me and Gary Churchill (1996). It allows us
to specify to the program that there will be a number of different
possible evolutionary rates, what the prior probabilities of
occurrence of each is, and what the average length of a patch of
positions all having the same rate. The rates can also be chosen by
the program to approximate a Gamma distribution of rates, or a Gamma
distribution plus a class of invariant positions. The program computes
the the likelihood by summing it over all possible assignments of
rates to positions, weighting each by its prior probability of
occurrence.
For example, if we have used the C and A options (described below) to
specify that there are three possible rates of evolution, 1.0, 2.4,
and 0.0, that the prior probabilities of a position having these rates
are 0.4, 0.3, and 0.3, and that the average patch length (number of
consecutive positions with the same rate) is 2.0, the program will sum
the likelihood over all possibilities, but giving less weight to those
that (say) assign all positions to rate 2.4, or that fail to have
consecutive positions that have the same rate.
The Hidden Markov Model framework for rate variation among positions
was independently developed by Yang (1993, 1994, 1995). We have
implemented a general scheme for a Hidden Markov Model of rates; we
allow the rates and their prior probabilities to be specified
arbitrarily by the user, or by a discrete approximation to a Gamma
distribution of rates (Yang, 1995), or by a mixture of a Gamma
distribution and a class of invariant positions.
This feature effectively removes the artificial assumption that all
positions have the same rate, and also means that we need not know in
advance the identities of the positions that have a particular rate of
evolution.
Another layer of rate variation also is available. The user can assign
categories of rates to each positions (for example, we might want
amino acid positions in the active site of a protein to change more
slowly than other positions. This is done with the categories input
file and the C option. We then specify (using the menu) the relative
rates of evolution of amino acid positions in the different
categories. For example, we might specify that positions in the active
site evolve at relative rates of 0.2 compared to 1.0 at other
positions. If we are assuming that a particular position maintains a
cysteine bridge to another, we may want to put it in a category of
positions (including perhaps the initial position of the protein
sequence which maintains methionine) which changes at a rate of 0.0.
If both user-assigned rate categories and Hidden Markov Model rates
are allowed, the program assumes that the actual rate at a position is
the product of the user-assigned category rate and the Hidden Markov
Model regional rate. (This may not always make perfect biological
sense: it would be more natural to assume some upper bound to the
rate, as we have discussed in the Felsenstein and Churchill paper).
Nevertheless you may want to use both types of rate variation.
INPUT FORMAT AND OPTIONS
-
Subject to these assumptions, the program is a correct maximum
likelihood method. The input is fairly standard, with one addition. As
usual the first line of the file gives the number of species and the
number of amino acid positions.
-
Next come the species data. Each sequence starts on a new line, has a
ten-character species name that must be blank-filled to be of that
length, followed immediately by the species data in the one-letter
amino acid code. The sequences must either be in the "interleaved" or
"sequential" formats described in the Molecular Sequence Programs
document. The I option selects between them. The sequences can have
internal blanks in the sequence but there must be no extra blanks at
the end of the terminated line. Note that a blank is not a valid
symbol for a deletion.
-
The options are selected using an interactive menu. The menu looks
like this:
Amino acid sequence Maximum Likelihood method, version 3.6a3
Settings for this run:
U Search for best tree? Yes
P JTT or PAM amino acid change model? Jones-Taylor-Thornton model
C One category of sites? Yes
R Rate variation among sites? constant rate of change
W Sites weighted? No
S Speedier but rougher analysis? Yes
G Global rearrangements? No
J Randomize input order of sequences? No. Use input order
O Outgroup root? No, use as outgroup species 1
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? (none)
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
5 Reconstruct hypothetical sequences? No
Y to accept these or type the letter for one to change
-
The user either types "Y" (followed, of course, by a carriage-return)
if the settings shown are to be accepted, or the letter or digit
corresponding to an option that is to be changed.
-
The options U, W, J, O, M, and 0 are the usual ones. They are
described in the main documentation file of this package. Option I is
the same as in other molecular sequence programs and is described in
the documentation file for the sequence programs.
-
The P option toggles between two models of amino acid change. One is
the Jones-Taylor-Thornton model, the other the Dayhoff PAM matrix
model. These are both based on Margaret Dayhoff's (Dayhoff and Eck,
1968; Dayhoff et. al., 1979) method of empirical tabulation of changes
of amino acid sequences, and conversion of these to a probability
model of amino acid change which is used to make a transition
probability matrix which allows prediction of the probability of
changing from any one amino acid to any other, and also predicts
equilibrium amino acid composition.
-
The default method is that of Jones, Taylor, and Thornton (1992). This
is similar to the Dayhoff PAM model, except that it is based on a
recounting of the number of observed changes in amino acids, using a
much larger sample of protein sequences than did Dayhoff. Because its
sample is so much larger this model is to be preferred over the
original Dayhoff PAM model. The Dayhoff model uses Dayhoff's PAM 001
matrix from Dayhoff et. al. (1979), page 348.
-
The R (Hidden Markov Model rates) option allows the user to
approximate a Gamma distribution of rates among positions, or a Gamma
distribution plus a class of invariant positions, or to specify how
many categories of substitution rates there will be in a Hidden Markov
Model of rate variation, and what are the rates and probabilities for
each. By repeatedly selecting the R option one toggles among no rate
variation, the Gamma, Gamma+I, and general HMM possibilities.
-
If you choose Gamma or Gamma+I the program will ask how many rate
categories you want. If you have chosen Gamma+I, keep in mind that one
rate category will be set aside for the invariant class and only the
remaining ones used to approximate the Gamma distribution. For the
approximation we do not use the quantile method of Yang (1995) but
instead use a quadrature method using generalized Laguerre
polynomials. This should give a good approximation to the Gamma
distribution with as few as 5 or 6 categories.
-
In the Gamma and Gamma+I cases, the user will be asked to supply the
coefficient of variation of the rate of substitution among positions.
This is different from the parameters used by Nei and Jin (1990) but
related to them: their parameter a is also known as "alpha", the shape
parameter of the Gamma distribution. It is related to the coefficient
of variation by
CV = 1 / a^1/2
-
or
a = 1 / (CV)^2
-
(their parameter b is absorbed here by the requirement that time is
scaled so that the mean rate of evolution is 1 per unit time, which
means that a = b). As we consider cases in which the rates are less
variable we should set a larger and larger, as CV gets smaller and
smaller.
-
If the user instead chooses the general Hidden Markov Model option,
they are first asked how many HMM rate categories there will be (for
the moment there is an upper limit of 9, which should not be
restrictive). Then the program asks for the rates for each category.
These rates are only meaningful relative to each other, so that rates
1.0, 2.0, and 2.4 have the exact same effect as rates 2.0, 4.0, and
4.8. Note that an HMM rate category can have rate of change 0, so that
this allows us to take into account that there may be a category of
amino acid positions that are invariant. Note that the run time of the
program will be proportional to the number of HMM rate categories:
twice as many categories means twice as long a run. Finally the
program will ask for the probabilities of a random amino acid position
falling into each of these regional rate categories. These
probabilities must be nonnegative and sum to 1. Default for the
program is one category, with rate 1.0 and probability 1.0 (actually
the rate does not matter in that case).
-
If more than one HMM rate category is specified, then another option,
A, becomes visible in the menu. This allows us to specify that we want
to assume that positions that have the same HMM rate category are
expected to be clustered so that there is autocorrelation of rates.
The program asks for the value of the average patch length. This is an
expected length of patches that have the same rate. If it is 1, the
rates of successive positions will be independent. If it is, say,
10.25, then the chance of change to a new rate will be 1/10.25 after
every position. However the "new rate" is randomly drawn from the mix
of rates, and hence could even be the same. So the actual observed
length of patches with the same rate will be a bit larger than 10.25.
Note below that if you choose multiple patches, there will be an
estimate in the output file as to which combination of rate categories
contributed most to the likelihood.
-
Note that the autocorrelation scheme we use is somewhat different from
Yang's (1995) autocorrelated Gamma distribution. I am unsure whether
this difference is of any importance -- our scheme is chosen for the
ease with which it can be implemented.
-
The C option allows user-defined rate categories. The user is prompted
for the number of user-defined rates, and for the rates themselves,
which cannot be negative but can be zero. These numbers, which must be
nonnegative (some could be 0), are defined relative to each other, so
that if rates for three categories are set to 1 : 3 : 2.5 this would
have the same meaning as setting them to 2 : 6 : 5. The assignment of
rates to amino acid positions is then made by reading a file whose
default name is "categories". It should contain a string of digits 1
through 9. A new line or a blank can occur after any character in this
string. Thus the categories file might look like this:
122231111122411155
1155333333444
-
With the current options R, A, and C the program has a good ability to
infer different rates at different positions and estimate phylogenies
under a more realistic model. Note that Likelihood Ratio Tests can be
used to test whether one combination of rates is significantly better
than another, provided one rate scheme represents a restriction of
another with fewer parameters. The number of parameters needed for
rate variation is the number of regional rate categories, plus the
number of user-defined rate categories less 2, plus one if the
regional rate categories have a nonzero autocorrelation.
-
The G (global search) option causes, after the last species is added
to the tree, each possible group to be removed and re-added. This
improves the result, since the position of every species is
reconsidered. It approximately triples the run-time of the program.
-
The User tree (option U) is read from a file whose default name is
intree. The trees can be multifurcating. They must be preceded in the
file by a line giving the number of trees in the file.
-
If the U (user tree) option is chosen another option appears in the
menu, the L option. If it is selected, it signals the program that it
should take any branch lengths that are in the user tree and simply
evaluate the likelihood of that tree, without further altering those
branch lengths. This means that if some branches have lengths and
others do not, the program will estimate the lengths of those that do
not have lengths given in the user tree. Note that the program RETREE
can be used to add and remove lengths from a tree.
-
The U option can read a multifurcating tree. This allows us to test
the hypothesis that a certain branch has zero length (we can also do
this by using RETREE to set the length of that branch to 0.0 when it
is present in the tree). By doing a series of runs with different
specified lengths for a branch we can plot a likelihood curve for its
branch length while allowing all other branches to adjust their
lengths to it. If all branches have lengths specified, none of them
will be iterated. This is useful to allow a tree produced by another
method to have its likelihood evaluated. The L option has no effect
and does not appear in the menu if the U option is not used.
-
The W (Weights) option is invoked in the usual way, with only weights
0 and 1 allowed. It selects a set of positions to be analyzed,
ignoring the others. The positions selected are those with weight 1.
If the W option is not invoked, all positions are analyzed. The
Weights (W) option takes the weights from a file whose default name is
"weights". The weights follow the format described in the main
documentation file.
-
The M (multiple data sets) option will ask you whether you want to use
multiple sets of weights (from the weights file) or multiple data sets
from the input file. The ability to use a single data set with
multiple weights means that much less disk space will be used for this
input data. The bootstrapping and jackknifing tool Seqboot has the
ability to create a weights file with multiple weights. Note also that
when we use multiple weights for bootstrapping we can also then
maintain different rate categories for different positions in a
meaningful way. You should not use the multiple data sets option
without using multiple weights, you should not at the same time use
the user-defined rate categories option (option C).
-
The algorithm used for searching among trees uses a technique invented
by David Swofford and J. S. Rogers. This involves not iterating most
branch lengths on most trees when searching among tree topologies,
This is of necessity a "quick-and-dirty" search but it saves much
time. There is a menu option (option S) which can turn off this search
and revert to the earlier search method which iterated branch lengths
in all topologies. This will be substantially slower but will also be
a bit more likely to find the tree topology of highest likelihood. If
the Swofford/Rogers search finds the best tree topology, the branch
lengths inferred will be almost precisely the same as they would be
with the more thorough search, as the maximization of likelihood with
respect to branch lengths for the final tree is not different in the
two kinds of search.
OUTPUT FORMAT
-
The output starts by giving the number of species and the number of
amino acid positions.
-
If the R (HMM rates) option is used a table of the relative rates of
expected substitution at each category of positions is printed, as
well as the probabilities of each of those rates.
-
There then follow the data sequences, if the user has selected the
menu option to print them, with the sequences printed in groups of ten
amino acids. The trees found are printed as an unrooted tree topology
(possibly rooted by outgroup if so requested). The internal nodes are
numbered arbitrarily for the sake of identification. The number of
trees evaluated so far and the log likelihood of the tree are also
given. Note that the trees printed out have a trifurcation at the
base. The branch lengths in the diagram are roughly proportional to
the estimated branch lengths, except that very short branches are
printed out at least three characters in length so that the
connections can be seen. The unit of branch length is the expected
fraction of amino acids changed (so that 1.0 is 100 PAMs).
-
A table is printed showing the length of each tree segment (in units
of expected amino acid substitutions per position), as well as (very)
rough confidence limits on their lengths. If a confidence limit is
negative, this indicates that rearrangement of the tree in that region
is not excluded, while if both limits are positive, rearrangement is
still not necessarily excluded because the variance calculation on
which the confidence limits are based results in an underestimate,
which makes the confidence limits too narrow.
-
In addition to the confidence limits, the program performs a crude
Likelihood Ratio Test (LRT) for each branch of the tree. The program
computes the ratio of likelihoods with and without this branch length
forced to zero length. This done by comparing the likelihoods changing
only that branch length. A truly correct LRT would force that branch
length to zero and also allow the other branch lengths to adjust to
that. The result would be a likelihood ratio closer to 1. Therefore
the present LRT will err on the side of being too significant. YOU ARE
WARNED AGAINST TAKING IT TOO SERIOUSLY. If you want to get a better
likelihood curve for a branch length you can do multiple runs with
different prespecified lengths for that branch, as discussed above in
the discussion of the L option.
-
One should also realize that if you are looking not at a
previously-chosen branch but at all branches, that you are seeing the
results of multiple tests. With 20 tests, one is expected to reach
significance at the P = .05 level purely by chance. You should
therefore use a much more conservative significance level, such as .05
divided by the number of tests. The significance of these tests is
shown by printing asterisks next to the confidence interval on each
branch length. It is important to keep in mind that both the
confidence limits and the tests are very rough and approximate, and
probably indicate more significance than they should. Nevertheless,
maximum likelihood is one of the few methods that can give you any
indication of its own error; most other methods simply fail to warn
the user that there is any error! (In fact, whole philosophical
schools of taxonomists exist whose main point seems to be that there
isn't any error, that the "most parsimonious" tree is the best tree by
definition and that's that).
-
The log likelihood printed out with the final tree can be used to
perform various likelihood ratio tests. One can, for example, compare
runs with different values of the relative rate of change in the
active site and in the rest of the protein to determine which value is
the maximum likelihood estimate, and what is the allowable range of
values (using a likelihood ratio test, which you will find described
in mathematical statistics books). One could also estimate the base
frequencies in the same way. Both of these, particularly the latter,
require multiple runs of the program to evaluate different possible
values, and this might get expensive.
-
If the U (User Tree) option is used and more than one tree is
supplied, and the program is not told to assume autocorrelation
between the rates at different amino acid positions, the program also
performs a statistical test of each of these trees against the one
with highest likelihood. If there are two user trees, the test done is
one which is due to Kishino and Hasegawa (1989), a version of a test
originally introduced by Templeton (1983). In this implementation it
uses the mean and variance of log-likelihood differences between
trees, taken across amino acid positions. If the two trees' means are
more than 1.96 standard deviations different then the trees are
declared significantly different. This use of the empirical variance
of log-likelihood differences is more robust and nonparametric than
the classical likelihood ratio test, and may to some extent compensate
for the any lack of realism in the model underlying this program.
-
If there are more than two trees, the test done is an extension of the
KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that
a correction for the number of trees was necessary, and they
introduced a resampling method to make this correction. In the version
used here the variances and covariances of the sum of log likelihoods
across amino acid positions are computed for all pairs of trees. To
test whether the difference between each tree and the best one is
larger than could have been expected if they all had the same expected
log-likelihood, log-likelihoods for all trees are sampled with these
covariances and equal means (Shimodaira and Hasegawa's "least
favorable hypothesis"), and a P value is computed from the fraction of
times the difference between the tree's value and the highest
log-likelihood exceeds that actually observed. Note that this sampling
needs random numbers, and so the program will prompt the user for a
random number seed if one has not already been supplied. With the
two-tree KHT test no random numbers are used.
-
In either the KHT or the SH test the program prints out a table of the
log-likelihoods of each tree, the differences of each from the highest
one, the variance of that quantity as determined by the log-likelihood
differences at individual sites, and a conclusion as to whether that
tree is or is not significantly worse than the best one. However the
test is not available if we assume that there is autocorrelation of
rates at neighboring positions (option A) and is not done in those
cases.
-
The branch lengths printed out are scaled in terms of expected numbers
of amino acid substitutions, scaled so that the average rate of
change, averaged over all the positions analyzed, is set to 1.0. if
there are multiple categories of positions. This means that whether or
not there are multiple categories of positions, the expected fraction
of change for very small branches is equal to the branch length. Of
course, when a branch is twice as long this does not mean that there
will be twice as much net change expected along it, since some of the
changes occur in the same position and overlie or even reverse each
other. The branch length estimates here are in terms of the expected
underlying numbers of changes. That means that a branch of length 0.26
is 26 times as long as one which would show a 1% difference between
the amino acid sequences at the beginning and end of the branch. But
we would not expect the sequences at the beginning and end of the
branch to be 26% different, as there would be some overlaying of
changes.
-
Confidence limits on the branch lengths are also given. Of course a
negative value of the branch length is meaningless, and a confidence
limit overlapping zero simply means that the branch length is not
necessarily significantly different from zero. Because of limitations
of the numerical algorithm, branch length estimates of zero will often
print out as small numbers such as 0.00001. If you see a branch length
that small, it is really estimated to be of zero length.
-
Another possible source of confusion is the existence of negative
values for the log likelihood. This is not really a problem; the log
likelihood is not a probability but the logarithm of a probability.
When it is negative it simply means that the corresponding probability
is less than one (since we are seeing its logarithm). The log
likelihood is maximized by being made more positive: -30.23 is worse
than -29.14.
-
At the end of the output, if the R option is in effect with multiple
HMM rates, the program will print a list of what amino acid position
categories contributed the most to the final likelihood. This
combination of HMM rate categories need not have contributed a
majority of the likelihood, just a plurality. Still, it will be
helpful as a view of where the program infers that the higher and
lower rates are. Note that the use in this calculations of the prior
probabilities of different rates, and the average patch length, gives
this inference a "smoothed" appearance: some other combination of
rates might make a greater contribution to the likelihood, but be
discounted because it conflicts with this prior information. See the
example output below to see what this printout of rate categories
looks like. A second list will also be printed out, showing for each
position which rate accounted for the highest fraction of the
likelihood. If the fraction of the likelihood accounted for is less
than 95%, a dot is printed instead.
-
Option 3 in the menu controls whether the tree is printed out into the
output file. This is on by default, and usually you will want to leave
it this way. However for runs with multiple data sets such as
bootstrapping runs, you will primarily be interested in the trees
which are written onto the output tree file, rather than the trees
printed on the output file. To keep the output file from becoming too
large, it may be wisest to use option 3 to prevent trees being printed
onto the output file.
-
Option 4 in the menu controls whether the tree estimated by the
program is written onto a tree file. The default name of this output
tree file is "outtree". If the U option is in effect, all the
user-defined trees are written to the output tree file.
-
Option 5 in the menu controls whether ancestral states are estimated
at each node in the tree. If it is in effect, a table of ancestral
sequences is printed out (including the sequences in the tip species
which are the input sequences). The symbol printed out is for the
amino acid which accounts for the largest fraction of the likelihood
at that position. In that table, if a position has an amino acid which
accounts for more than 95% of the likelihood, its symbol printed in
capital letters (W rather than w). One limitation of the current
version of the program is that when there are multiple HMM rates
(option R) the reconstructed amino acids are based on only the single
assignment of rates to positions which accounts for the largest amount
of the likelihood. Thus the assessment of 95% of the likelihood, in
tabulating the ancestral states, refers to 95% of the likelihood that
is accounted for by that particular combination of rates.
PROGRAM CONSTANTS
-
The constants defined at the beginning of the program include
"maxtrees", the maximum number of user trees that can be processed. It
is small (100) at present to save some further memory but the cost of
increasing it is not very great. Other constants include
"maxcategories", the maximum number of position categories,
"namelength", the length of species names in characters, and three
others, "smoothings", "iterations", and "epsilon", that help "tune"
the algorithm and define the compromise between execution speed and
the quality of the branch lengths found by iteratively maximizing the
likelihood. Reducing iterations and smoothings, and increasing
epsilon, will result in faster execution but a worse result. These
values will not usually have to be changed.
-
The program spends most of its time doing real arithmetic. The
algorithm, with separate and independent computations occurring for
each pattern, lends itself readily to parallel processing.
PAST AND FUTURE OF THE PROGRAM
-
This program is derived in version 3.6 by Lucas Mix from DNAML, with
which it shares many of its data structures and much of its strategy.
_________________________________________________________________
-
TEST DATA SET
(Note that although these may look like DNA sequences, they are being
treated as protein sequences consisting entirely of alanine, cystine,
glycine, and threonine).
5 13
Alpha AACGTGGCCAAAT
Beta AAGGTCGCCAAAC
Gamma CATTTCGTCACAA
Delta GGTATTTCGGCCT
Epsilon GGGATCTCGGCCC
_________________________________________________________________
CONTENTS OF OUTPUT FILE (with all numerical options on)
(It was run with HMM rates having gamma-distributed rates approximated
by 5 rate categories, with coefficient of variation of rates 1.0, and
with patch length parameter = 1.5. Two user-defined rate categories
were used, one for the first 6 positions, the other for the last 7,
with rates 1.0 : 2.0. Weights were used, with sites 1 and 13 given
weight 0, and all others weight 1.)
Amino acid sequence Maximum Likelihood method, version 3.6a3
5 species, 13 sites
-
Site categories are:
1111112222 222
-
Sites are weighted as follows:
0111111111 111
Jones-Taylor-Thornton model of amino acid change
Name Sequences
--- ---------
Alpha AACGTGGCCA AAT
Beta ..G..C.... ..C
Gamma C.TT.C.T.. C.A
Delta GGTA.TT.GG CC.
Epsilon GGGA.CT.GG CCC
Discrete approximation to gamma distributed rates
Coefficient of variation of rates = 1.000000 (alpha = 1.000000)
States in HMM Rate of change Probability
1 0.264 0.522
2 1.413 0.399
3 3.596 0.076
4 7.086 0.0036
5 12.641 0.000023
Site category Rate of change
remember: this is an unrooted tree!
Ln Likelihood = -121.49044
Between And Length Approx. Confidence Limits
------ --- ------ ------- ---------- ------
-
1 Alpha 60.18362 ( zero, 135.65380) **
1 Beta 0.00010 ( zero, infinity)
1 2 32.56292 ( zero, 96.08019) *
2 3 141.85557 ( zero, 304.10906) **
3 Epsilon 0.00010 ( zero, infinity)
3 Delta 68.68682 ( zero, 151.95402) **
2 Gamma 89.79037 ( zero, 198.93830) **
-
* = significantly positive, P < 0.05
** = significantly positive, P < 0.01
Combination of categories that contributes the most to the likelihood:
1122121111 112
Most probable category at each site if > 0.95 probability ("." otherwise)
....1..... ...
Probable sequences at interior nodes:
-
node Reconstructed sequence (caps if > 0.95)
-
1 .AGGTCGCCA AAC
Beta AAGGTCGCCA AAC
2 .AggTCGCCA CAC
3 .GGATCTCGG CCC
Epsilon GGGATCTCGG CCC
Delta GGTATTTCGG CCT
Gamma CATTTCGTCA CAA
Alpha AACGTGGCCA AAT
|