Google
More docs on the ARB website.
See also index of helppages.
Last update on 04. May 2016 .
Main topics:
Related topics:

    phyml

    DISCLAIMER

    This file has been automatically converted from the original documentation for easy use inside the ARB help system. Differences compared with the original documentation are unintentionally caused by the conversion process. In doubt please refer to the original documentation!

     

    DOCUMENTATION

    # generated from ../../GDE/PHYML/usersguide_phyliplike.html

    PHYML User's guide (PHYLIP-like interface)

    Overview

    PHYML is a software implementing a new method for building phylogenies from DNA and protein sequences using maximum likelihood. Data sets can be analysed under several models of evolution (JC69, K80, F81, F84, HKY85, TN93 and GTR for nucleotides and Dayhoff, JTT, mtREV, WAG, DCMut, RtREV, CpREV, VT, Blosum62 and MtMam for amino acids). A discrete-gamma model (Yang, 1994) is implemented to accommodate rate variation among sites. Invariable sites can also be taken into account. PHYML has been compared to several other softwares using extensive simulations. The results indicate that its topological accuracy is at least as high as that of fastDNAml, while being much faster.

    The PHYLIP-like interface

    Download the binary files ; you can execute PHYML by double-clicking on the "phyml" file or by opening a shell window and typing "phyml" without parameters. The interactive command-line interface is PHYLIP-like. You can change the default value of an option by typing its corresponding character and validate your settings by typing 'Y'. PHYML produces several results files :
    <sequence file name>_phyml_lk.txt : likelihood value(s)
    <sequence file name>_phyml_tree.txt : inferred tree(s)
    <sequence file name>_phyml_stat.txt : detailed execution stats
      <sequence file name>_phyml_boot_trees.txt : bootstrap trees (special
    case)
      <sequence file name>_phyml_boot_stats.txt : bootstrap statistics
    (special case)
    Here are the possible uses of PHYML :
      One data set, one starting tree
    Standard analysis under a given substitution model, PHYML then returns
    the inferred tree. Moreover, a special option allows to perform
    non-parametric bootstrapp analysis on the original data set. PHYML
    then returns the bootstrap tree with branch lengths and bootstrap
    values, using standard NEWICK format (an option gives the pseudo trees
    in a *_boot_trees.txt file).
      Several data sets, one starting tree
    Several standard analysis start from the same intial tree with
    different data sets, without the bootstrap option.
    The results are given in the order of the data sets.
    This can be used to process multiple genes in a supertree approach.
      One data set, several starting trees
    Several standard analysis of the same data set using different
    starting tree situations, without the bootstrap option.
    All results are given in the order of the trees. Moreover, the most
    likely tree is provided in the *_best_stat.txt and *_best_tree.txt
    files.
    This should be used to avoid being trapped into local optima and then
    obtain better trees. Fast parsimony methods can be used to obtain a
    set of starting trees.
      Several data sets, several starting trees
    Several standard runs, where each data set is analysed with the
    corresponding starting tree, without the bootstrap option.
    The results are given in the order of the data sets.
    This can be used when comparing the likelihood of various trees
    regarding different data sets.

    Options

         Sequences  The  input  sequence  file  is  a standard PHYLIP file of
       aligned  DNA  or  amino-acids  sequences.  It should look like this in
       interleaved format :
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAG
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGG
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGG
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGG
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGG

    GAAATGGTCAATATTACAAGGT GAAATGGTCAACATTAAAAGAT GAAATCGTCAATATTAAAAGGT GAAATGGTCAATCTTAAAAGGT GAAATGGTCAATATTAAAAGGT

       The same data set in sequential format:
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    On  the  first line is the number of taxa, a space, then the number of
    characters for each taxon.
    The  maximum  number of characters in species name MUST not exceed 50.
    Blanks  within  the species name are NOT allowed. However, blanks (one
    or more) MUST appear at the end of each species name.
    In a sequence, three special characters '.', '-', and '?' may be used:
    a  dot  '.'  means the same character as in the first sequence, a dash
    '-'  means  an  alignment  gap  and  a  question  mark  '?'  means  an
    undetermined  nucleotide. Sites at which one or more sequences involve
    '-' are NOT excluded from the analysis. Therefore, gaps are treated as
    unknown  character (like '?') on the grounds that ''we don't know what
    would  be  there  if  something  were  there'' (J. Felsenstein, PHYLIP
    documentation). Finally, standard ambiguity characters for nucleotides
    are accepted (Table 1).
    CAPTION: Table 1 - Nucleotide character coding
     Character  Nucleotide
         A       Adenosine
         G        Guanine
         C       Cytosine
         T        Thymine
         U        Uracil
         M        A or C
         R        A or G
         W        A or T
         S        C or G
         Y        C or T
         K        G or T
         B      C or G or T
         D      A or G or T
         H      A or C or T
         V      A or C or G
    N or X or ?   unknown
    CAPTION: Table 2 - Amino-acid character coding
    Character  Amino-acid
        A        Alanine
        R       Arginine
     N or B    Asparagine
        D     Aspartic acid
        C       Cysteine
     Q or Z     Glutamine
        E     Glutamic acid
        G        Glycine
        H       Histidine
        I      Isoleucine
        L        Leucine
        K        Lysine
        M      Methionine
        F     Phenylalanine
        P        Proline
        S        Serine
        T       Threonine
        W      Tryptophan
        Y       Tyrosine
        V        Valine
     X or ?      unknown
      Data type
    This  indicates  if the sequence file contains DNA or amino-acids. The
    default choice is to analyse DNA sequences.
      Sequence format
    The  input  sequences  can  be  either  in  interleaved  (default)  or
    sequential format, see "Sequences" above.
         Number of data sets
       Multiple  data  sets  are  allowed, e.g. to perform bootstrap analysis
       using  SEQBOOT  (from the PHYLIP package). In this case, the data sets
       are  given  one  after  the other, in the formats above explained. For
       example (with three data sets):
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
    5 60
    Tax1        CCATCTCACGGTCGGTACGATACACCTGCTTTTGGCAGGAAATGGTCAATATTACAAGGT
    Tax2        CCATCTCACGGTCAGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAACATTAAAAGAT
    Tax3        CCATCTCCCGCTCAGTAAGATACCCCTGCTGTTGGCGGGAAATCGTCAATATTAAAAGGT
    Tax4        TCATCTCATGGTCAATAAGATACTCCTGCTTTTGGCGGGAAATGGTCAATCTTAAAAGGT
    Tax5        CCATCTCACGGTCGGTAAGATACACCTGCTTTTGGCGGGAAATGGTCAATATTAAAAGGT
      Perform bootstrap and Number of pseudo data sets
    When  there  is  only  one  data  set  you  can  ask PHYML to generate
    bootstrapped  pseudo data sets from this original data set. PHYML then
    returns  the  bootstrap tree with branch lengths and bootstrap values,
    using  standard  NEWICK  format. The "Print pseudo trees" option gives
    the pseudo trees in a *_boot_trees.txt file.
      Substitution model
    A  nucleotide or amino-acid substitution model. For DNA sequences, the
    default  choice  is  HKY85  (Hasegawa  et  al.,  1985).  This model is
    analogous  to  K80  (Kimura,  1980),  but  allows  for  different base
    frequencies.  The  other models are JC69 (Jukes and Cantor, 1969), K80
    (Kimura, 1980), F81 (Felsenstein, 1981), F84 (Felsenstein, 1989), TN93
    (Tamura and Nei, 1993) and GTR (e.g., Lanave et al. 1984, Tavaré 1986,
    Rodriguez et al. 1990). The rate matrices of these models are given in
    Swofford et al. (1996).
    It   is   also  possible  to  specify  a  custom  substitution  model,
    considering that six substitution rate parameters and four equilibrium
    frequencies   define  time-reversible  DNA  substitution  models.  The
    substitution rates are defined by a string of six digits :
    digit 1 digit 2 digit 3 digit 4 digit 5 digit 6
    A<->C   A<->G   A<->T   C<->G   C<->T   G<->T
    000000  defines  a  model  where  the six relative rate parameters are
    equal  :  this  corresponds  to  the  JC69  model  if  the equilibrium
    frequencies are equal (0.25), or the F81 model if they are different.
    010010  corresponds  to  a  model  where the A<->G and C<->T rates are
    optimised  independently  of  the  other  parameters : this is the K80
    model if base frequencies are equal (0.25), or the HKY85 model if they
    are different. 010020 is the TN93 model. 012345 is the GTR model. This
    notation  is  very concise and allows to define a wide range of models
    in  a  comprehensive  framework. For amino-acid sequences, the default
    choice is JTT (Jones, Taylor and Thornton, 1992). The other models are
    Dayhoff (Dayhoff et al., 1978), mtREV (as implemented in Yang's PAML),
    WAG  (Whelan  and Goldman, 2001) and DCMut (Kosiol and Goldman, 2005),
    RtREV  (Dimmic  et  al.),  CpREV  (Adachi et al., 2000) VT (Muller and
    Vingron, 2000), Blosum62 (Henikoff anf Henikoff, 1992) and MtMam (Cao,
    1998).
      Base frequency estimates
    Under  most of the nucleotide based models (except JC69 and K2P), base
    frequencies  can be estimated from the data (empirical) or adjusted so
    as  to  maximise  the  likelihood  (ML).  The  later makes the program
    slower.  Comparing the results obtained under the two options might be
    useful  when  analysing sequences that correspond to concatenations of
    several genes with different nucleotide compositions.
      Transition / transversion ratio
    With  DNA sequences, it is possible to set the transition/transversion
    ratio, except for the JC69 and F81 models, or to estimate its value by
    maximising  the  likelihood  of  the  phylogeny.  The  later makes the
    program  slower.  The  default  value  is  4.0.  The definition of the
    transition/transversion  ratio is the same as in PAML (Yang, 1994). In
    PHYLIP,  the  ''transition/transversion  rate ratio'' is used instead.
    4.0 in PHYML roughly corresponds to 2.0 in PHYLIP.
      Proportion of invariable sites
    The  default  is  to  consider  that  the  data  set  does not contain
    invariable  sites  (0.0).  However,  this proportion can be set to any
    value  in  the  0.0-1.0 range. This parameter can also be estimated by
    maximising  the  likelihood  of  the  phylogeny.  The  later makes the
    program slower.
      Number of substitution rate categories
    The  default  is having all the sites evolving at the same rate, hence
    having  one  substitution rate category. A discrete-gamma distribution
    can be used to account for variable substitution rates among sites, in
    which  case the number of categories that defines this distribution is
    supplied  by  the  user.  The  higher  this  number, the better is the
    goodness-of-fit  regarding the continuous distribution. The default is
    to  use  four categories, in this case the likelihood of the phylogeny
    at   one   site   is   averaged   over  four  conditional  likelihoods
    corresponding  to  four rates and the computation of the likelihood is
    four  times  slower than with a unique rate. Number of categories less
    than four or higher than eight are not recommended. In the first case,
    the  discrete  distribution  is a poor approximation of the continuous
    one.  In the second case, the computational burden becomes high and an
    higher  number  of categories is not likely to enhance the accuracy of
    phylogeny estimation.
      Gamma distribution parameter
    The  shape  of  a  gamma  distribution  is  defined  by this numerical
    parameter.   The   higher  its  value,  the  lower  the  variation  of
    substitution  rates  among sites (this option is used when having more
    than  1  substitution  rate  category).  The  default value is 1.0. It
    corresponds  to  a  moderate  variation.  Values  less  than  say  0.7
    correspond  to high variations. Values between 0.7 and 1.5 corresponds
    to  moderate  variations.  Higher values correspond to low variations.
    This  value  can  be  fixed  by  the user. It can also be estimated by
    maximising the likelihood of the phylogeny.
      Starting tree(s)
    Used  as  the starting tree(s) to be refined by the maximum likelihood
    algorithm.  The  default  is to use a BIONJ distance-based tree. It is
    also possible to supply one or several trees in NEWICK format, one per
    line  in  the  file, which must be written in the standard parenthesis
    representation (NEWICK format) ; the branch lengths must be given, and
    the  tree(s)  must  be unrooted. Labels on branches (such as bootstrap
    proportions)  are supported. Therefore, a tree with four taxa named A,
    B, C, and D with a bootstrap value equal to 90 on its internal branch,
    should look like this:
    (A:0.02,B:0.004,(C:0.1,D:0.04)90:0.05);
    If  you  give  several  trees  and  analyse  several data sets the two
    numbers must match.
      Optimise starting tree(s) options
    You  can  optimise  the  starting  tree(s)  in  three ways : - You can
    optimise   the  topology,  the  branch  lengths  and  rate  parameters
    (transition/transversion  ratio,  proportion of invariant sites, gamma
    distribution  parameter), - You can keep the topology and optimise the
    branch lengths and rate parameters (it is not possible to optimise the
    tree  topology  and  keep  the  branch  lengths), - You can ask for no
    optimisation,  PHYML  just  returns  the  likelihood  of  the starting
    tree(s).

    References

    Z. Yang (1994) J. Mol. Evol. 39, 306-14.
    S. Ota & W.-H. Li (2001) Mol. Biol. Evol. 18, 1983-1992.
    N. Saitou & M. Nei (1987) Mol. Biol. Evol. 4(4), 406-425.
      W. Bruno,   N. D.   Socci,   &   A. L.  Halpern  (2000)  Mol.  Biol.
    Evol. 17, 189-197.
    J. Felsenstein (1989) Cladistics 5, 164-166.
      G. J.   Olsen,   H. Matsuda,   R. Hagstrom,   &  R. Overbeek  (1994)
    CABIOS 10, 41-48.
    N. Goldman (1993) J. Mol. Evol. 36, 182-198.
    M. Kimura (1980) J. Mol. Evol. 16, 111-120.
      T. H.  Jukes  & C. R. Cantor (1969) in Mammalian Protein Metabolism,
    ed. H. N. Munro. (Academic Press, New York) Vol. III, pp. 21-132.
      M. Hasegawa,   H. Kishino,   &   T. Yano   (1985)   J.  Mol.  Evol.
    22, 160-174.
    J. Felsenstein (1981) J. Mol. Evol. 17, 368-376.
      David L. Swofford, Gary J. Olsen, Peter J. Waddel, & David M. Hillis
    (1996) in Molecular Systematics, eds. David M. Hillis, Craig Moritz, &
    Barbara K.    Mable.    (Sinauer    Associates,    Inc.,   Sunderland,
    Massachusetts, USA).
    K. Tamura & M. Nei (1993) Mol. Biol. Evol. 10, 512-526.
      Lanave C, Preparata G., Saccone C. and Serio G.. (1984) A new method
    for   calculating  evolutionary  substitution  rates.  J.  Mol.  Evol.
    20, 86-93.
      Dayhoff, M. O., R. M. Schwartz, and B. C. Orcutt. (1978). A model of
    evolutionary  change  in  proteins.  In: Dayhoff, M. O. (ed.) Atlas of
    Protein  Sequence  Structur,  Vol.  5,  Suppl.  3. National Biomedical
    Research Foundation, Washington DC, pp. 345-352.
      Jones,  D.  T.,  W.  R.  Taylor, and J. M. Thornton. 1992. The rapid
    generation of mutation data matrices from protein sequences. CABIOS 8:
    275-282.
      S.  Whelan  and  N.  Goldman.  (2001).  A general empirical model of
    protein  evolution  derived  from  multiple  protein  families using a
    maximum-likelihood approach Mol. Biol. Evol. 18, 691-699
      Dimmic  M.W.,  J.S.  Rest,  D.P.  Mindell,  and  D. Goldstein. 2002.
    RArtREV: An amino acid substitution matrix for inference of retrovirus
    and  reverse  transcriptase  phylogeny. Journal of Molecular Evolution
    55: 65-73.
      Adachi,  J.,  P.  Waddell, W. Martin, and M. Hasegawa. 2000. Plastid
    genome  phylogeny  and a model of amino acid substitution for proteins
    encoded by chloroplast DNA. Journal of Molecular Evolution 50:348-358.
      Muller,  T.,  and M. Vingron. 2000. Modeling amino acid replacement.
    Journal of Computational Biology 7:761-776.
      Henikoff,  S.,  and  J.  G.  Henikoff. 1992. Amino acid substitution
    matrices   from   protein  blocks.  Proc.  Natl.  Acad.  Sci.,  U.S.A.
    89:10915-10919.
      Cao,  Y.  et  al.  1998  Conflict  amongst  individual mitochondrial
    proteins  in  resolving  the phylogeny of eutherian orders. Journal of
    Molecular Evolution 15:1600-1611.