Google
More docs on the ARB website.
See also index of helppages.
Last update on 19. Nov 2015 .
Main topics:
Related topics:

    proml

    DISCLAIMER

    This file has been automatically converted from the original documentation for easy use inside the ARB help system. Differences compared with the original documentation are unintentionally caused by the conversion process. In doubt please refer to the original documentation!

     

    DOCUMENTATION

    # generated from ../../GDE/PHYLIP/doc/proml.html

    version 3.6
    ProML -- Protein Maximum Likelihood program
    (C)  Copyright  1986-2002  by  the  University of Washington. Written by
    Joseph  Felsenstein.  Permission  is  granted  to  copy  this document
    provided  that no fee is charged for it and that this copyright notice
    is not removed.
    This  program  implements  the  maximum  likelihood method for protein
    amino  acid sequences. It uses the either the Jones-Taylor-Thornton or
    the  Dayhoff  probability  model  of  change  between amino acids. The
    assumptions of these present models are:
    1. Each position in the sequence evolves independently.
    2. Different lineages evolve independently.
    3. Each  position undergoes substitution at an expected rate which is
         chosen  from  a  series  of  rates  (each  with  a  probability of
         occurrence) which we specify.
    4. All  relevant  positions  are  included  in the sequence, not just
         those  that  have  changed  or  those  that  are "phylogenetically
         informative".
    5. The  probabilities  of change between amino acids are given by the
      model of Jones, Taylor, and Thornton (1992) or by the PAM model of
      Dayhoff (Dayhoff and Eck, 1968; Dayhoff et. al., 1979).

    Note  the  assumption  that we are looking at all positions, including
    those  that  have  not changed at all. It is important not to restrict
    attention to some positions based on whether or not they have changed;
    doing that would bias branch lengths by making them too long, and that
    in  turn  would  cause the method to misinterpret the meaning of those
    positions that had changed.
    This  program  uses  a  Hidden  Markov Model (HMM) method of inferring
    different  rates  of evolution at different amino acid positions. This
    was described in a paper by me and Gary Churchill (1996). It allows us
    to  specify  to  the  program that there will be a number of different
    possible   evolutionary   rates,   what  the  prior  probabilities  of
    occurrence  of  each  is,  and  what  the average length of a patch of
    positions  all  having  the same rate. The rates can also be chosen by
    the  program  to approximate a Gamma distribution of rates, or a Gamma
    distribution plus a class of invariant positions. The program computes
    the  the  likelihood  by  summing  it over all possible assignments of
    rates  to  positions,  weighting  each  by  its  prior  probability of
    occurrence.
    For  example, if we have used the C and A options (described below) to
    specify  that  there  are three possible rates of evolution, 1.0, 2.4,
    and 0.0, that the prior probabilities of a position having these rates
    are  0.4,  0.3,  and 0.3, and that the average patch length (number of
    consecutive positions with the same rate) is 2.0, the program will sum
    the likelihood over all possibilities, but giving less weight to those
    that  (say)  assign  all  positions  to rate 2.4, or that fail to have
    consecutive positions that have the same rate.
    The  Hidden  Markov Model framework for rate variation among positions
    was  independently  developed  by  Yang  (1993,  1994,  1995). We have
    implemented  a  general  scheme for a Hidden Markov Model of rates; we
    allow  the  rates  and  their  prior  probabilities  to  be  specified
    arbitrarily  by  the  user,  or by a discrete approximation to a Gamma
    distribution  of  rates  (Yang,  1995),  or  by  a  mixture of a Gamma
    distribution and a class of invariant positions.
    This  feature  effectively  removes the artificial assumption that all
    positions  have the same rate, and also means that we need not know in
    advance the identities of the positions that have a particular rate of
    evolution.
    Another layer of rate variation also is available. The user can assign
    categories  of  rates  to  each  positions (for example, we might want
    amino  acid  positions  in the active site of a protein to change more
    slowly  than  other  positions. This is done with the categories input
    file  and  the C option. We then specify (using the menu) the relative
    rates   of   evolution  of  amino  acid  positions  in  the  different
    categories. For example, we might specify that positions in the active
    site  evolve  at  relative  rates  of  0.2  compared  to  1.0 at other
    positions.  If  we are assuming that a particular position maintains a
    cysteine  bridge  to  another,  we may want to put it in a category of
    positions  (including  perhaps  the  initial  position  of the protein
    sequence which maintains methionine) which changes at a rate of 0.0.
    If  both  user-assigned  rate categories and Hidden Markov Model rates
    are allowed, the program assumes that the actual rate at a position is
    the  product  of the user-assigned category rate and the Hidden Markov
    Model  regional  rate.  (This  may  not always make perfect biological
    sense:  it  would  be  more  natural to assume some upper bound to the
    rate,  as  we  have discussed in the Felsenstein and Churchill paper).
    Nevertheless you may want to use both types of rate variation.

    INPUT FORMAT AND OPTIONS

    Subject  to  these  assumptions,  the  program  is  a  correct maximum
    likelihood method. The input is fairly standard, with one addition. As
    usual  the  first line of the file gives the number of species and the
    number of amino acid positions.
    Next  come the species data. Each sequence starts on a new line, has a
    ten-character  species  name  that  must be blank-filled to be of that
    length,  followed  immediately  by  the species data in the one-letter
    amino  acid code. The sequences must either be in the "interleaved" or
    "sequential"  formats  described  in  the  Molecular Sequence Programs
    document.  The  I  option selects between them. The sequences can have
    internal  blanks  in the sequence but there must be no extra blanks at
    the  end  of  the  terminated  line.  Note that a blank is not a valid
    symbol for a deletion.
    The  options  are  selected  using an interactive menu. The menu looks
    like this:

    Amino acid sequence Maximum Likelihood method, version 3.6a3

    Settings for this run:
      U                 Search for best tree?  Yes
      P   JTT or PAM amino acid change model?  Jones-Taylor-Thornton model
      C                One category of sites?  Yes
      R           Rate variation among sites?  constant rate of change
      W                       Sites weighted?  No
      S        Speedier but rougher analysis?  Yes
      G                Global rearrangements?  No
      J   Randomize input order of sequences?  No. Use input order
      O                        Outgroup root?  No, use as outgroup species  1
      M           Analyze multiple data sets?  No
      I          Input sequences interleaved?  Yes
      0   Terminal type (IBM PC, ANSI, none)?  (none)
      1    Print out the data at start of run  No
      2  Print indications of progress of run  Yes
      3                        Print out tree  Yes
      4       Write out trees onto tree file?  Yes
      5   Reconstruct hypothetical sequences?  No
    Y to accept these or type the letter for one to change
    The  user either types "Y" (followed, of course, by a carriage-return)
    if  the  settings  shown  are  to  be accepted, or the letter or digit
    corresponding to an option that is to be changed.
    The  options  U,  W,  J,  O,  M,  and  0  are the usual ones. They are
    described  in the main documentation file of this package. Option I is
    the  same  as in other molecular sequence programs and is described in
    the documentation file for the sequence programs.
    The  P  option toggles between two models of amino acid change. One is
    the  Jones-Taylor-Thornton  model,  the  other  the Dayhoff PAM matrix
    model.  These  are  both based on Margaret Dayhoff's (Dayhoff and Eck,
    1968; Dayhoff et. al., 1979) method of empirical tabulation of changes
    of  amino  acid  sequences,  and  conversion of these to a probability
    model  of  amino  acid  change  which  is  used  to  make a transition
    probability  matrix  which  allows  prediction  of  the probability of
    changing  from  any  one  amino  acid  to any other, and also predicts
    equilibrium amino acid composition.
    The default method is that of Jones, Taylor, and Thornton (1992). This
    is  similar  to  the  Dayhoff  PAM model, except that it is based on a
    recounting  of  the number of observed changes in amino acids, using a
    much  larger sample of protein sequences than did Dayhoff. Because its
    sample  is  so  much  larger  this  model  is to be preferred over the
    original  Dayhoff  PAM model. The Dayhoff model uses Dayhoff's PAM 001
    matrix from Dayhoff et. al. (1979), page 348.
    The   R  (Hidden  Markov  Model  rates)  option  allows  the  user  to
    approximate  a Gamma distribution of rates among positions, or a Gamma
    distribution  plus  a  class of invariant positions, or to specify how
    many categories of substitution rates there will be in a Hidden Markov
    Model  of rate variation, and what are the rates and probabilities for
    each.  By  repeatedly selecting the R option one toggles among no rate
    variation, the Gamma, Gamma+I, and general HMM possibilities.
    If  you  choose  Gamma  or  Gamma+I the program will ask how many rate
    categories you want. If you have chosen Gamma+I, keep in mind that one
    rate  category  will be set aside for the invariant class and only the
    remaining  ones  used  to  approximate the Gamma distribution. For the
    approximation  we  do  not  use the quantile method of Yang (1995) but
    instead   use   a   quadrature   method   using  generalized  Laguerre
    polynomials.  This  should  give  a  good  approximation  to the Gamma
    distribution with as few as 5 or 6 categories.
    In  the  Gamma and Gamma+I cases, the user will be asked to supply the
    coefficient  of variation of the rate of substitution among positions.
    This  is  different from the parameters used by Nei and Jin (1990) but
    related to them: their parameter a is also known as "alpha", the shape
    parameter  of the Gamma distribution. It is related to the coefficient
    of variation by
    CV = 1 / a^1/2
    or
    a = 1 / (CV)^2
    (their  parameter  b  is absorbed here by the requirement that time is
    scaled  so  that  the mean rate of evolution is 1 per unit time, which
    means  that  a  = b). As we consider cases in which the rates are less
    variable  we  should  set  a larger and larger, as CV gets smaller and
    smaller.
    If  the  user  instead chooses the general Hidden Markov Model option,
    they  are  first asked how many HMM rate categories there will be (for
    the  moment  there  is  an  upper  limit  of  9,  which  should not be
    restrictive).  Then  the program asks for the rates for each category.
    These  rates are only meaningful relative to each other, so that rates
    1.0,  2.0,  and  2.4 have the exact same effect as rates 2.0, 4.0, and
    4.8. Note that an HMM rate category can have rate of change 0, so that
    this  allows  us  to take into account that there may be a category of
    amino acid positions that are invariant. Note that the run time of the
    program  will  be  proportional  to the number of HMM rate categories:
    twice  as  many  categories  means  twice  as  long a run. Finally the
    program will ask for the probabilities of a random amino acid position
    falling   into   each   of   these  regional  rate  categories.  These
    probabilities  must  be  nonnegative  and  sum  to  1. Default for the
    program  is  one category, with rate 1.0 and probability 1.0 (actually
    the rate does not matter in that case).
    If  more than one HMM rate category is specified, then another option,
    A, becomes visible in the menu. This allows us to specify that we want
    to  assume  that  positions  that  have the same HMM rate category are
    expected  to  be  clustered so that there is autocorrelation of rates.
    The program asks for the value of the average patch length. This is an
    expected  length  of  patches that have the same rate. If it is 1, the
    rates  of  successive  positions  will  be independent. If it is, say,
    10.25,  then  the chance of change to a new rate will be 1/10.25 after
    every  position. However the "new rate" is randomly drawn from the mix
    of  rates,  and  hence  could even be the same. So the actual observed
    length  of patches with the same rate will be a bit larger than 10.25.
    Note  below  that  if  you  choose  multiple patches, there will be an
    estimate in the output file as to which combination of rate categories
    contributed most to the likelihood.
    Note that the autocorrelation scheme we use is somewhat different from
    Yang's  (1995)  autocorrelated Gamma distribution. I am unsure whether
    this  difference  is of any importance -- our scheme is chosen for the
    ease with which it can be implemented.
    The C option allows user-defined rate categories. The user is prompted
    for  the  number  of user-defined rates, and for the rates themselves,
    which cannot be negative but can be zero. These numbers, which must be
    nonnegative  (some could be 0), are defined relative to each other, so
    that  if  rates for three categories are set to 1 : 3 : 2.5 this would
    have  the same meaning as setting them to 2 : 6 : 5. The assignment of
    rates  to  amino  acid  positions is then made by reading a file whose
    default  name  is "categories". It should contain a string of digits 1
    through 9. A new line or a blank can occur after any character in this
    string. Thus the categories file might look like this:

    122231111122411155 1155333333444

    With the current options R, A, and C the program has a good ability to
    infer  different rates at different positions and estimate phylogenies
    under  a more realistic model. Note that Likelihood Ratio Tests can be
    used  to test whether one combination of rates is significantly better
    than  another,  provided  one  rate scheme represents a restriction of
    another  with  fewer  parameters.  The number of parameters needed for
    rate  variation  is  the  number of regional rate categories, plus the
    number  of  user-defined  rate  categories  less  2,  plus  one if the
    regional rate categories have a nonzero autocorrelation.
    The  G  (global search) option causes, after the last species is added
    to  the  tree,  each  possible  group to be removed and re-added. This
    improves   the   result,  since  the  position  of  every  species  is
    reconsidered. It approximately triples the run-time of the program.
    The  User  tree  (option  U) is read from a file whose default name is
    intree.  The trees can be multifurcating. They must be preceded in the
    file by a line giving the number of trees in the file.
    If  the  U  (user tree) option is chosen another option appears in the
    menu,  the L option. If it is selected, it signals the program that it
    should  take  any  branch lengths that are in the user tree and simply
    evaluate  the  likelihood of that tree, without further altering those
    branch  lengths.  This  means  that  if some branches have lengths and
    others  do not, the program will estimate the lengths of those that do
    not  have lengths given in the user tree. Note that the program RETREE
    can be used to add and remove lengths from a tree.
    The  U  option  can read a multifurcating tree. This allows us to test
    the  hypothesis  that a certain branch has zero length (we can also do
    this  by  using RETREE to set the length of that branch to 0.0 when it
    is  present  in  the  tree).  By doing a series of runs with different
    specified  lengths for a branch we can plot a likelihood curve for its
    branch  length  while  allowing  all  other  branches  to adjust their
    lengths  to  it.  If all branches have lengths specified, none of them
    will  be  iterated. This is useful to allow a tree produced by another
    method  to  have  its likelihood evaluated. The L option has no effect
    and does not appear in the menu if the U option is not used.
    The  W (Weights) option is invoked in the usual way, with only weights
    0  and  1  allowed.  It  selects  a  set  of positions to be analyzed,
    ignoring  the  others. The positions selected are those with weight 1.
    If  the  W  option  is  not  invoked,  all positions are analyzed. The
    Weights (W) option takes the weights from a file whose default name is
    "weights".  The  weights  follow  the  format  described  in  the main
    documentation file.
    The M (multiple data sets) option will ask you whether you want to use
    multiple sets of weights (from the weights file) or multiple data sets
    from  the  input  file.  The  ability  to  use  a single data set with
    multiple weights means that much less disk space will be used for this
    input  data.  The  bootstrapping  and jackknifing tool Seqboot has the
    ability to create a weights file with multiple weights. Note also that
    when  we  use  multiple  weights  for  bootstrapping  we can also then
    maintain  different  rate  categories  for  different  positions  in a
    meaningful  way.  You  should  not  use  the multiple data sets option
    without  using  multiple  weights, you should not at the same time use
    the user-defined rate categories option (option C).
    The algorithm used for searching among trees uses a technique invented
    by  David  Swofford and J. S. Rogers. This involves not iterating most
    branch  lengths  on  most  trees when searching among tree topologies,
    This  is  of  necessity  a  "quick-and-dirty" search but it saves much
    time. There is a menu option (option S) which can turn off this search
    and  revert to the earlier search method which iterated branch lengths
    in  all topologies. This will be substantially slower but will also be
    a  bit more likely to find the tree topology of highest likelihood. If
    the  Swofford/Rogers  search  finds the best tree topology, the branch
    lengths  inferred  will  be almost precisely the same as they would be
    with  the more thorough search, as the maximization of likelihood with
    respect  to  branch lengths for the final tree is not different in the
    two kinds of search.

    OUTPUT FORMAT

    The  output  starts  by giving the number of species and the number of
    amino acid positions.
    If  the  R (HMM rates) option is used a table of the relative rates of
    expected  substitution  at  each  category of positions is printed, as
    well as the probabilities of each of those rates.
    There  then  follow  the  data sequences, if the user has selected the
    menu option to print them, with the sequences printed in groups of ten
    amino  acids. The trees found are printed as an unrooted tree topology
    (possibly  rooted by outgroup if so requested). The internal nodes are
    numbered  arbitrarily  for  the  sake of identification. The number of
    trees  evaluated  so  far  and the log likelihood of the tree are also
    given.  Note  that  the  trees  printed out have a trifurcation at the
    base.  The  branch  lengths in the diagram are roughly proportional to
    the  estimated  branch  lengths,  except  that very short branches are
    printed   out  at  least  three  characters  in  length  so  that  the
    connections  can  be  seen.  The unit of branch length is the expected
    fraction of amino acids changed (so that 1.0 is 100 PAMs).
    A  table  is printed showing the length of each tree segment (in units
    of  expected amino acid substitutions per position), as well as (very)
    rough  confidence  limits  on  their lengths. If a confidence limit is
    negative, this indicates that rearrangement of the tree in that region
    is  not  excluded, while if both limits are positive, rearrangement is
    still  not  necessarily  excluded  because the variance calculation on
    which  the  confidence  limits  are based results in an underestimate,
    which makes the confidence limits too narrow.
    In  addition  to  the  confidence limits, the program performs a crude
    Likelihood  Ratio  Test (LRT) for each branch of the tree. The program
    computes  the ratio of likelihoods with and without this branch length
    forced to zero length. This done by comparing the likelihoods changing
    only  that  branch length. A truly correct LRT would force that branch
    length  to  zero  and also allow the other branch lengths to adjust to
    that.  The  result  would be a likelihood ratio closer to 1. Therefore
    the present LRT will err on the side of being too significant. YOU ARE
    WARNED  AGAINST  TAKING  IT TOO SERIOUSLY. If you want to get a better
    likelihood  curve  for  a  branch length you can do multiple runs with
    different  prespecified lengths for that branch, as discussed above in
    the discussion of the L option.
    One   should   also   realize  that  if  you  are  looking  not  at  a
    previously-chosen  branch but at all branches, that you are seeing the
    results  of  multiple  tests.  With 20 tests, one is expected to reach
    significance  at  the  P  =  .05  level  purely  by chance. You should
    therefore use a much more conservative significance level, such as .05
    divided  by  the  number  of tests. The significance of these tests is
    shown  by  printing  asterisks next to the confidence interval on each
    branch  length.  It  is  important  to  keep  in  mind  that  both the
    confidence  limits  and  the tests are very rough and approximate, and
    probably  indicate  more  significance than they should. Nevertheless,
    maximum  likelihood  is  one  of the few methods that can give you any
    indication  of  its  own error; most other methods simply fail to warn
    the  user  that  there  is  any  error!  (In fact, whole philosophical
    schools  of  taxonomists exist whose main point seems to be that there
    isn't any error, that the "most parsimonious" tree is the best tree by
    definition and that's that).
    The  log  likelihood  printed  out  with the final tree can be used to
    perform  various likelihood ratio tests. One can, for example, compare
    runs  with  different  values  of  the  relative rate of change in the
    active site and in the rest of the protein to determine which value is
    the  maximum  likelihood  estimate, and what is the allowable range of
    values  (using  a likelihood ratio test, which you will find described
    in  mathematical  statistics  books). One could also estimate the base
    frequencies  in  the same way. Both of these, particularly the latter,
    require  multiple  runs  of the program to evaluate different possible
    values, and this might get expensive.
    If  the  U  (User  Tree)  option  is  used  and  more than one tree is
    supplied,  and  the  program  is  not  told  to assume autocorrelation
    between  the rates at different amino acid positions, the program also
    performs  a  statistical  test  of each of these trees against the one
    with highest likelihood. If there are two user trees, the test done is
    one  which  is due to Kishino and Hasegawa (1989), a version of a test
    originally  introduced  by Templeton (1983). In this implementation it
    uses  the  mean  and  variance  of  log-likelihood differences between
    trees,  taken across amino acid positions. If the two trees' means are
    more  than  1.96  standard  deviations  different  then  the trees are
    declared  significantly  different. This use of the empirical variance
    of  log-likelihood  differences  is more robust and nonparametric than
    the classical likelihood ratio test, and may to some extent compensate
    for the any lack of realism in the model underlying this program.
    If there are more than two trees, the test done is an extension of the
    KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that
    a  correction  for  the  number  of  trees  was  necessary,  and  they
    introduced a resampling method to make this correction. In the version
    used  here the variances and covariances of the sum of log likelihoods
    across  amino  acid  positions are computed for all pairs of trees. To
    test  whether  the  difference  between  each tree and the best one is
    larger than could have been expected if they all had the same expected
    log-likelihood,  log-likelihoods  for all trees are sampled with these
    covariances   and   equal  means  (Shimodaira  and  Hasegawa's  "least
    favorable hypothesis"), and a P value is computed from the fraction of
    times  the  difference  between  the  tree's  value  and  the  highest
    log-likelihood exceeds that actually observed. Note that this sampling
    needs  random  numbers,  and so the program will prompt the user for a
    random  number  seed  if  one  has not already been supplied. With the
    two-tree KHT test no random numbers are used.
    In either the KHT or the SH test the program prints out a table of the
    log-likelihoods of each tree, the differences of each from the highest
    one, the variance of that quantity as determined by the log-likelihood
    differences  at  individual sites, and a conclusion as to whether that
    tree  is  or is not significantly worse than the best one. However the
    test  is  not  available if we assume that there is autocorrelation of
    rates  at  neighboring  positions  (option A) and is not done in those
    cases.
    The branch lengths printed out are scaled in terms of expected numbers
    of  amino  acid  substitutions,  scaled  so  that  the average rate of
    change,  averaged  over  all the positions analyzed, is set to 1.0. if
    there are multiple categories of positions. This means that whether or
    not  there are multiple categories of positions, the expected fraction
    of  change  for  very small branches is equal to the branch length. Of
    course,  when  a branch is twice as long this does not mean that there
    will  be twice as much net change expected along it, since some of the
    changes  occur  in  the same position and overlie or even reverse each
    other.  The  branch length estimates here are in terms of the expected
    underlying numbers of changes. That means that a branch of length 0.26
    is  26  times  as long as one which would show a 1% difference between
    the  amino  acid sequences at the beginning and end of the branch. But
    we  would  not  expect  the  sequences at the beginning and end of the
    branch  to  be  26%  different,  as  there would be some overlaying of
    changes.
    Confidence  limits  on  the branch lengths are also given. Of course a
    negative  value  of the branch length is meaningless, and a confidence
    limit  overlapping  zero  simply  means  that the branch length is not
    necessarily  significantly different from zero. Because of limitations
    of the numerical algorithm, branch length estimates of zero will often
    print out as small numbers such as 0.00001. If you see a branch length
    that small, it is really estimated to be of zero length.
    Another  possible  source  of  confusion  is the existence of negative
    values  for  the log likelihood. This is not really a problem; the log
    likelihood  is  not  a probability but the logarithm of a probability.
    When it is negative it simply means that the corresponding probability
    is  less  than  one  (since  we  are  seeing  its  logarithm). The log
    likelihood  is  maximized by being made more positive: -30.23 is worse
    than -29.14.
    At  the  end of the output, if the R option is in effect with multiple
    HMM  rates,  the program will print a list of what amino acid position
    categories   contributed  the  most  to  the  final  likelihood.  This
    combination  of  HMM  rate  categories  need  not  have  contributed a
    majority  of  the  likelihood,  just  a  plurality.  Still, it will be
    helpful  as  a  view  of  where the program infers that the higher and
    lower  rates  are. Note that the use in this calculations of the prior
    probabilities  of different rates, and the average patch length, gives
    this  inference  a  "smoothed"  appearance:  some other combination of
    rates  might  make  a  greater  contribution to the likelihood, but be
    discounted  because  it conflicts with this prior information. See the
    example  output  below  to  see  what this printout of rate categories
    looks  like.  A second list will also be printed out, showing for each
    position  which  rate  accounted  for  the  highest  fraction  of  the
    likelihood.  If  the  fraction of the likelihood accounted for is less
    than 95%, a dot is printed instead.
    Option 3 in the menu controls whether the tree is printed out into the
    output file. This is on by default, and usually you will want to leave
    it  this  way.  However  for  runs  with  multiple  data  sets such as
    bootstrapping  runs,  you  will  primarily  be interested in the trees
    which  are  written  onto  the output tree file, rather than the trees
    printed  on the output file. To keep the output file from becoming too
    large, it may be wisest to use option 3 to prevent trees being printed
    onto the output file.
    Option  4  in  the  menu  controls  whether  the tree estimated by the
    program  is  written onto a tree file. The default name of this output
    tree  file  is  "outtree".  If  the  U  option  is  in effect, all the
    user-defined trees are written to the output tree file.
    Option  5  in the menu controls whether ancestral states are estimated
    at  each  node  in  the tree. If it is in effect, a table of ancestral
    sequences  is  printed out (including the sequences in the tip species
    which  are  the  input  sequences).  The symbol printed out is for the
    amino  acid  which accounts for the largest fraction of the likelihood
    at that position. In that table, if a position has an amino acid which
    accounts  for  more  than 95% of the likelihood, its symbol printed in
    capital  letters  (W  rather  than  w).  One limitation of the current
    version  of  the  program  is  that  when there are multiple HMM rates
    (option  R) the reconstructed amino acids are based on only the single
    assignment of rates to positions which accounts for the largest amount
    of  the  likelihood.  Thus the assessment of 95% of the likelihood, in
    tabulating  the ancestral states, refers to 95% of the likelihood that
    is accounted for by that particular combination of rates.

    PROGRAM CONSTANTS

    The  constants  defined  at  the  beginning  of  the  program  include
    "maxtrees", the maximum number of user trees that can be processed. It
    is  small (100) at present to save some further memory but the cost of
    increasing   it   is   not   very   great.   Other  constants  include
    "maxcategories",   the   maximum   number   of   position  categories,
    "namelength",  the  length  of  species names in characters, and three
    others,  "smoothings",  "iterations",  and "epsilon", that help "tune"
    the  algorithm  and  define the compromise between execution speed and
    the  quality of the branch lengths found by iteratively maximizing the
    likelihood.   Reducing   iterations  and  smoothings,  and  increasing
    epsilon,  will  result  in  faster execution but a worse result. These
    values will not usually have to be changed.
    The  program  spends  most  of  its  time  doing  real arithmetic. The
    algorithm,  with  separate  and independent computations occurring for
    each pattern, lends itself readily to parallel processing.

    PAST AND FUTURE OF THE PROGRAM

    This  program  is derived in version 3.6 by Lucas Mix from DNAML, with
    which it shares many of its data structures and much of its strategy.
      _________________________________________________________________
    TEST DATA SET
    (Note  that although these may look like DNA sequences, they are being
    treated  as protein sequences consisting entirely of alanine, cystine,
    glycine, and threonine).
       5   13
    Alpha     AACGTGGCCAAAT
    Beta      AAGGTCGCCAAAC
    Gamma     CATTTCGTCACAA
    Delta     GGTATTTCGGCCT
    Epsilon   GGGATCTCGGCCC
         _________________________________________________________________
    CONTENTS OF OUTPUT FILE (with all numerical options on)
    (It was run with HMM rates having gamma-distributed rates approximated
    by  5 rate categories, with coefficient of variation of rates 1.0, and
    with  patch  length  parameter = 1.5. Two user-defined rate categories
    were  used,  one  for the first 6 positions, the other for the last 7,
    with  rates  1.0  :  2.0. Weights were used, with sites 1 and 13 given
    weight 0, and all others weight 1.)

    Amino acid sequence Maximum Likelihood method, version 3.6a3

    5 species,  13  sites
    Site categories are:
    1111112222 222
    Sites are weighted as follows:
    0111111111 111

    Jones-Taylor-Thornton model of amino acid change

    Name            Sequences
    ----            ---------
    Alpha        AACGTGGCCA AAT
    Beta         ..G..C.... ..C
    Gamma        C.TT.C.T.. C.A
    Delta        GGTA.TT.GG CC.
    Epsilon      GGGA.CT.GG CCC
    Discrete approximation to gamma distributed rates
     Coefficient of variation of rates = 1.000000  (alpha = 1.000000)
    States in HMM   Rate of change    Probability
    1           0.264            0.522
    2           1.413            0.399
    3           3.596            0.076
    4           7.086            0.0036
    5          12.641            0.000023
    Site category   Rate of change
    1           1.000
    2           2.000
    +Beta
    |
    |                                       +Epsilon
    |         +-----------------------------3
    1---------2                             +-------------------Delta
    |         |
    |         +--------------------------Gamma
    |
    +-----------------Alpha

    remember: this is an unrooted tree!

    Ln Likelihood =  -121.49044
    Between        And            Length      Approx. Confidence Limits
    -------        ---            ------      ------- ---------- ------
    1          Alpha            60.18362     (     zero,   135.65380) **
    1          Beta              0.00010     (     zero,    infinity)
    1             2             32.56292     (     zero,    96.08019) *
    2             3            141.85557     (     zero,   304.10906) **
    3          Epsilon           0.00010     (     zero,    infinity)
    3          Delta            68.68682     (     zero,   151.95402) **
    2          Gamma            89.79037     (     zero,   198.93830) **
    *  = significantly positive, P < 0.05
    ** = significantly positive, P < 0.01

    Combination of categories that contributes the most to the likelihood:

    1122121111 112

    Most probable category at each site if > 0.95 probability ("." otherwise)

    ....1..... ...

    Probable sequences at interior nodes:

    node       Reconstructed sequence (caps if > 0.95)
       1        .AGGTCGCCA AAC
    Beta        AAGGTCGCCA AAC
       2        .AggTCGCCA CAC
       3        .GGATCTCGG CCC
    Epsilon     GGGATCTCGG CCC
    Delta       GGTATTTCGG CCT
    Gamma       CATTTCGTCA CAA
    Alpha       AACGTGGCCA AAT