# generated from ../../GDE/PHYLIP/doc/dnadist.html
version 3.6
DNADIST -- Program to compute distance matrix
from nucleotide sequences
(C) Copyright 1986-2002 by the University of Washington. Written by
Joseph Felsenstein. Permission is granted to copy this document
provided that no fee is charged for it and that this copyright notice
is not removed.
This program uses nucleotide sequences to compute a distance matrix,
under four different models of nucleotide substitution. It can also
compute a table of similarity between the nucleotide sequences. The
distance for each pair of species estimates the total branch length
between the two species, and can be used in the distance matrix
programs FITCH, KITSCH or NEIGHBOR. This is an alternative to use of
the sequence data itself in the maximum likelihood program DNAML or
the parsimony program DNAPARS.
The program reads in nucleotide sequences and writes an output file
containing the distance matrix, or else a table of similarity between
sequences. The four models of nucleotide substitution are those of
Jukes and Cantor (1969), Kimura (1980), the F84 model (Kishino and
Hasegawa, 1989; Felsenstein and Churchill, 1996), and the model
underlying the LogDet distance (Barry and Hartigan, 1987; Lake, 1994;
Steel, 1994; Lockhart et. al., 1994). All except the LogDet distance
can be made to allow for for unequal rates of substitution at
different sites, as Jin and Nei (1990) did for the Jukes-Cantor model.
The program correctly takes into account a variety of sequence
ambiguities, although in cases where they exist it can be slow.
Jukes and Cantor's (1969) model assumes that there is independent
change at all sites, with equal probability. Whether a base changes is
independent of its identity, and when it changes there is an equal
probability of ending up with each of the other three bases. Thus the
transition probability matrix (this is a technical term from
probability theory and has nothing to do with transitions as opposed
to transversions) for a short period of time dt is:
To: A G C T
---------------------------------
A | 1-3a a a a
From: G | a 1-3a a a
C | a a 1-3a a
T | a a a 1-3a
where a is u dt, the product of the rate of substitution per unit time
(u) and the length dt of the time interval. For longer periods of time
this implies that the probability that two sequences will differ at a
given site is:
p = ^3/[4] ( 1 - e^- 4/3 u t)
and hence that if we observe p, we can compute an estimate of the
branch length ut by inverting this to get
ut = - ^3/[4] log[e] ( 1 - ^4/[3] p )
The Kimura "2-parameter" model is almost as symmetric as this, but
allows for a difference between transition and transversion rates. Its
transition probability matrix for a short interval of time is:
To: A G C T
---------------------------------
A | 1-a-2b a b b
From: G | a 1-a-2b b b
C | b b 1-a-2b a
T | b b a 1-a-2b
where a is u dt, the product of the rate of transitions per unit time
and dt is the length dt of the time interval, and b is v dt, the
product of half the rate of transversions (i.e., the rate of a
specific transversion) and the length dt of the time interval.
The F84 model incorporates different rates of transition and
transversion, but also allowing for different frequencies of the four
nucleotides. It is the model which is used in DNAML, the maximum
likelihood nucelotide sequence phylogenies program in this package.
You will find the model described in the document for that program.
The transition probabilities for this model are given by Kishino and
Hasegawa (1989), and further explained in a paper by me and Gary
Churchill (1996).
The LogDet distance allows a fairly general model of substitution. It
computes the distance from the determinant of the empirically observed
matrix of joint probabilities of nucleotides in the two species. An
explanation of it is available in the chapter by Swofford et, al.
(1996).
The first three models are closely related. The DNAML model reduces to
Kimura's two-parameter model if we assume that the equilibrium
frequencies of the four bases are equal. The Jukes-Cantor model in
turn is a special case of the Kimura 2-parameter model where a = b.
Thus each model is a special case of the ones that follow it,
Jukes-Cantor being a special case of both of the others.
The Jin and Nei (1990) correction for variation in rate of evolution
from site to site can be adapted to all of the first three models. It
assumes that the rate of substitution varies from site to site
according to a gamma distribution, with a coefficient of variation
that is specified by the user. The user is asked for it when choosing
this option in the menu.
Each distance that is calculated is an estimate, from that particular
pair of species, of the divergence time between those two species. For
the Jukes- Cantor model, the estimate is computed using the formula
for ut given above, as long as the nucleotide symbols in the two
sequences are all either A, C, G, T, U, N, X, ?, or - (the latter four
indicate a deletion or an unknown nucleotide. This estimate is a
maximum likelihood estimate for that model. For the Kimura 2-parameter
model, with only these nucleotide symbols, formulas special to that
estimate are also computed. These are also, in effect, computing the
maximum likelihood estimate for that model. In the Kimura case it
depends on the observed sequences only through the sequence length and
the observed number of transition and transversion differences between
those two sequences. The calculation in that case is a maximum
likelihood estimate and will differ somewhat from the estimate
obtained from the formulas in Kimura's original paper. That formula
was also a maximum likelihood estimate, but with the
transition/transversion ratio estimated empirically, separately for
each pair of sequences. In the present case, one overall preset
transition/transversion ratio is used which makes the computations
harder but achieves greater consistency between different comparisons.
For the F84 model, or for any of the models where one or both
sequences contain at least one of the other ambiguity codons such as
Y, R, etc., a maximum likelihood calculation is also done using code
which was originally written for DNAML. Its disadvantage is that it is
slow. The resulting distance is in effect a maximum likelihood
estimate of the divergence time (total branch length between) the two
sequences. However the present program will be much faster than
versions earlier than 3.5, because I have speeded up the iterations.
The LogDet model computes the distance from the determinant of the
matrix of co-occurrence of nucleotides in the two species, according
to the formula
D = - ^1/[4](log[e](|F|) - ^1/[2]log[e](f[A]^1f[C]^1f[G]^1f[T]^1f[A]^2f[C]^
2f[G]^2f[T]^2))
Where F is a matrix whose (i,j) element is the fraction of sites at
which base i occurs in one species and base j occurs in the other.
f[j]^i is the fraction of sites at which species i has base j. The
LogDet distance cannot cope with ambiguity codes. It must have
completely defined sequences. One limitation of the LogDet distance is
that it may be infinite sometimes, if there are too many changes
between certain pairs of nucleotides. This can be particularly
noticeable with distances computed from bootstrapped sequences.
Note that there is an assumption that we are looking at all sites,
including those that have not changed at all. It is important not to
restrict attention to some sites based on whether or not they have
changed; doing that would bias the distances by making them too large,
and that in turn would cause the distances to misinterpret the meaning
of those sites that had changed.
For all of these distance methods, the program allows us to specify
that "third position" bases have a different rate of substitution than
first and second positions, that introns have a different rate than
exons, and so on. The Categories option which does this allows us to
make up to 9 categories of sites and specify different rates of change
for them.
In addition to the four distance calculations, the program can also
compute a table of similarities between nucleotide sequences. These
values are the fractions of sites identical between the sequences. The
diagonal values are 1.0000. No attempt is made to count similarity of
nonidentical nucleotides, so that no credit is given for having (for
example) different purines at corresponding sites in the two
sequences. This option has been requested by many users, who need it
for descriptive purposes. It is not intended that the table be used
for inferring the tree.
INPUT FORMAT AND OPTIONS
Input is fairly standard, with one addition. As usual the first line
of the file gives the number of species and the number of sites.
Next come the species data. Each sequence starts on a new line, has a
ten-character species name that must be blank-filled to be of that
length, followed immediately by the species data in the one-letter
code. The sequences must either be in the "interleaved" or
"sequential" formats described in the Molecular Sequence Programs
document. The I option selects between them. The sequences can have
internal blanks in the sequence but there must be no extra blanks at
the end of the terminated line. Note that a blank is not a valid
symbol for a deletion -- neither is dot (".").
The options are selected using an interactive menu. The menu looks
like this:
Nucleic acid sequence Distance Matrix program, version 3.6a3
Settings for this run:
D Distance (F84, Kimura, Jukes-Cantor, LogDet)? F84
G Gamma distributed rates across sites? No
T Transition/transversion ratio? 2.0
C One category of substitution rates? Yes
W Use weights for sites? No
F Use empirical base frequencies? Yes
L Form of distance matrix? Square
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type (IBM PC, ANSI, none)? (none)
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these or type the letter for one to change
The user either types "Y" (followed, of course, by a carriage-return)
if the settings shown are to be accepted, or the letter or digit
corresponding to an option that is to be changed.
The D option selects one of the four distance methods, or the
similarity table. It toggles among the five methods. The default
method, if none is specified, is the F84 model.
If the G (Gamma distribution) option is selected, the user will be
asked to supply the coefficient of variation of the rate of
substitution among sites. This is different from the parameters used
by Nei and Jin but related to them: their parameter a is also known as
"alpha", the shape parameter of the Gamma distribution. It is related
to the coefficient of variation by
CV = 1 / a^1/2
(their parameter b is absorbed here by the requirement that time is
scaled so that the mean rate of evolution is 1 per unit time, which
means that a = b). As we consider cases in which the rates are less
variable we should set a larger and larger, as CV gets smaller and
smaller.
The F (Frequencies) option appears when the Maximum Likelihood
distance is selected. This distance requires that the program be
provided with the equilibrium frequencies of the four bases A, C, G,
and T (or U). Its default setting is one which may save users much
time. If you want to use the empirical frequencies of the bases,
observed in the input sequences, as the base frequencies, you simply
use the default setting of the F option. These empirical frequencies
are not really the maximum likelihood estimates of the base
frequencies, but they will often be close to those values (what they
are is maximum likelihood estimates under a "star" or "explosion"
phylogeny). If you change the setting of the F option you will be
prompted for the frequencies of the four bases. These must add to 1
and are to be typed on one line separated by blanks, not commas.
The T option in this program does not stand for Threshold, but instead
is the Transition/transversion option. The user is prompted for a real
number greater than 0.0, as the expected ratio of transitions to
transversions. Note that this is not the ratio of the first to the
second kinds of events, but the resulting expected ratio of
transitions to transversions. The exact relationship between these two
quantities depends on the frequencies in the base pools. The default
value of the T parameter if you do not use the T option is 2.0.
The C option allows user-defined rate categories. The user is prompted
for the number of user-defined rates, and for the rates themselves,
which cannot be negative but can be zero. These numbers, which must be
nonnegative (some could be 0), are defined relative to each other, so
that if rates for three categories are set to 1 : 3 : 2.5 this would
have the same meaning as setting them to 2 : 6 : 5. The assignment of
rates to sites is then made by reading a file whose default name is
"categories". It should contain a string of digits 1 through 9. A new
line or a blank can occur after any character in this string. Thus the
categories file might look like this:
122231111122411155 1155333333444
The L option specifies that the output file is to have the distance
matrix in lower triangular form.
The W (Weights) option is invoked in the usual way, with only weights
0 and 1 allowed. It selects a set of sites to be analyzed, ignoring
the others. The sites selected are those with weight 1. If the W
option is not invoked, all sites are analyzed. The Weights (W) option
takes the weights from a file whose default name is "weights". The
weights follow the format described in the main documentation file.
The M (multiple data sets) option will ask you whether you want to use
multiple sets of weights (from the weights file) or multiple data sets
from the input file. The ability to use a single data set with
multiple weights means that much less disk space will be used for this
input data. The bootstrapping and jackknifing tool Seqboot has the
ability to create a weights file with multiple weights. Note also that
when we use multiple weights for bootstrapping we can also then
maintain different rate categories for different sites in a meaningful
way. You should not use the multiple data sets option without using
multiple weights, you should not at the same time use the user-defined
rate categories option (option C).
The options 0 is the usual one. It is described in the main
documentation file of this package. Option I is the same as in other
molecular sequence programs and is described in the documentation file
for the sequence programs.
OUTPUT FORMAT
As the distances are computed, the program prints on your screen or
terminal the names of the species in turn, followed by one dot (".")
for each other species for which the distance to that species has been
computed. Thus if there are ten species, the first species name is
printed out, followed by nine dots, then on the next line the next
species name is printed out followed by eight dots, then the next
followed by seven dots, and so on. The pattern of dots should form a
triangle. When the distance matrix has been written out to the output
file, the user is notified of that.
The output file contains on its first line the number of species. The
distance matrix is then printed in standard form, with each species
starting on a new line with the species name, followed by the
distances to the species in order. These continue onto a new line
after every nine distances. If the L option is used, the matrix or
distances is in lower triangular form, so that only the distances to
the other species that precede each species are printed. Otherwise the
distance matrix is square with zero distances on the diagonal. In
general the format of the distance matrix is such that it can serve as
input to any of the distance matrix programs.
If the option to print out the data is selected, the output file will
precede the data by more complete information on the input and the
menu selections. The output file begins by giving the number of
species and the number of characters, and the identity of the distance
measure that is being used.
If the C (Categories) option is used a table of the relative rates of
expected substitution at each category of sites is printed, and a
listing of the categories each site is in.
There will then follow the equilibrium frequencies of the four bases.
If the Jukes-Cantor or Kimura distances are used, these will
necessarily be 0.25 : 0.25 : 0.25 : 0.25. The output then shows the
transition/transversion ratio that was specified or used by default.
In the case of the Jukes-Cantor distance this will always be 0.5. The
transition-transversion parameter (as opposed to the ratio) is also
printed out: this is used within the program and can be ignored. There
then follow the data sequences, with the base sequences printed in
groups of ten bases along the lines of the Genbank and EMBL formats.
The distances printed out are scaled in terms of expected numbers of
substitutions, counting both transitions and transversions but not
replacements of a base by itself, and scaled so that the average rate
of change, averaged over all sites analyzed, is set to 1.0 if there
are multiple categories of sites. This means that whether or not there
are multiple categories of sites, the expected fraction of change for
very small branches is equal to the branch length. Of course, when a
branch is twice as long this does not mean that there will be twice as
much net change expected along it, since some of the changes may occur
in the same site and overlie or even reverse each other. The branch
lengths estimates here are in terms of the expected underlying numbers
of changes. That means that a branch of length 0.26 is 26 times as
long as one which would show a 1% difference between the nucleotide
sequences at the beginning and end of the branch. But we would not
expect the sequences at the beginning and end of the branch to be 26%
different, as there would be some overlaying of changes.
One problem that can arise is that two or more of the species can be
so dissimilar that the distance between them would have to be
infinite, as the likelihood rises indefinitely as the estimated
divergence time increases. For example, with the Jukes-Cantor model,
if the two sequences differ in 75% or more of their positions then the
estimate of dovergence time would be infinite. Since there is no way
to represent an infinite distance in the output file, the program
regards this as an error, issues an error message indicating which
pair of species are causing the problem, and stops. It might be that,
had it continued running, it would have also run into the same problem
with other pairs of species. If the Kimura distance is being used
there may be no error message; the program may simply give a large
distance value (it is iterating towards infinity and the value is just
where the iteration stopped). Likewise some maximum likelihood
estimates may also become large for the same reason (the sequences
showing more divergence than is expected even with infinite branch
length). I hope in the future to add more warning messages that would
alert the user the this.
If the similarity table is selected, the table that is produced is not
in a format that can be used as input to the distance matrix programs.
it has a heading, and the species names are also put at the tops of
the columns of the table (or rather, the first 8 characters of each
species name is there, the other two characters omitted to save
space). There is not an option to put the table into a format that can
be read by the distance matrix programs, nor is there one to make it
into a table of fractions of difference by subtracting the similarity
values from 1. This is done deliberately to make it more difficult for
the use to use these values to construct trees. The similarity values
are not corrected for multiple changes, and their use to construct
trees (even after converting them to fractions of difference) would be
wrong, as it would lead to severe conflict between the distant pairs
of sequences and the close pairs of sequences.
PROGRAM CONSTANTS
The constants that are available to be changed by the user at the
beginning of the program include "maxcategories", the maximum number
of site categories, "iterations", which controls the number of times
the program iterates the EM algorithm that is used to do the maximum
likelihood distance, "namelength", the length of species names in
characters, and "epsilon", a parameter which controls the accuracy of
the results of the iterations which estimate the distances. Making
"epsilon" smaller will increase run times but result in more decimal
places of accuracy. This should not be necessary.
The program spends most of its time doing real arithmetic. The
algorithm, with separate and independent computations occurring for
each pattern, lends itself readily to parallel processing.
_________________________________________________________________
TEST DATA SET
5 13
Alpha AACGTGGCCACAT
Beta AAGGTCGCCACAC
Gamma CAGTTCGCCACAA
Delta GAGATTTCCGCCT
Epsilon GAGATCTCCGCCC
_________________________________________________________________
CONTENTS OF OUTPUT FILE (with all numerical options on)
(Note that when the options for displaying the input data are turned
off, the output is in a form suitable for use as an input file in the
distance matrix programs).
Nucleic acid sequence Distance Matrix program, version 3.6a3
5 species, 13 sites
F84 Distance
Transition/transversion ratio = 2.000000
Name Sequences
---- ---------
Alpha AACGTGGCCA CAT
Beta AAGGTCGCCA CAC
Gamma CAGTTCGCCA CAA
Delta GAGATTTCCG CCT
Epsilon GAGATCTCCG CCC
Empirical Base Frequencies:
A 0.24615
C 0.36923
G 0.21538
T(U) 0.16923
Alpha 0.0000 0.3039 0.8575 1.1589 1.5429
Beta 0.3039 0.0000 0.3397 0.9135 0.6197
Gamma 0.8575 0.3397 0.0000 1.6317 1.2937
Delta 1.1589 0.9135 1.6317 0.0000 0.1659
Epsilon 1.5429 0.6197 1.2937 0.1659 0.0000
|