-
Notifications
You must be signed in to change notification settings - Fork 23
BASEML
The default control file for BASEML
is baseml.ctl
. Please consider the following when generating your control file:
- Spaces are required on both sides of the equal sign
-
Blank lines or lines beginning with
*
are treated as comments. - Options not used can be deleted from the control file.
- The order of the variables is not important.
You can find an example of a control file to execute the PAML
program BASEML
below:
seqfile = brown.nuc * path to sequence data file
outfile = mlb * path to main result file
treefile = brown.trees * path to tree file
noisy = 3 * 0,1,2,3: how much rubbish on the screen
verbose = 0 * 1: detailed output, 0: concise output
runmode = 0 * 0: user tree; 1: semi-automatic; 2: automatic
* 3: StepwiseAddition; (4,5):PerturbationNNI
model = 5 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
* 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu
Mgene = 0 * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff
* ndata = 1 * number of data sets
clock = 0 * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
TipDate = 0 100 * TipDate (1) & time unit
fix_kappa = 0 * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
kappa = 2.5 * initial or fixed kappa
fix_alpha = 1 * 0: estimate alpha; 1: fix alpha at value below
alpha = 0. * initial or fixed alpha, 0:infinity (constant rate)
Malpha = 0 * 1: different alpha's for genes, 0: one alpha
ncatG = 5 * # of categories in the dG, AdG, or nparK models of rates
fix_rho = 1 * 0: estimate rho; 1: fix rho at value below
rho = 0. * initial or fixed rho, 0:no correlation
nparK = 0 * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK
nhomo = 0 * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2, 5: user
getSE = 0 * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states
Small_Diff = 1e-6 * small value used in the difference approximation of derivatives
method = 0 * 0: simultaneous; 1: one branch at a time
* cleandata = 1 * remove sites with ambiguity data (1:yes, 0:no)?
* icode = 0 * (RateAncestor=1 for coding genes, "GC" in data)
* fix_blength = 0 * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional
NOTE
This is just an example; you may have to remove some options and/or modify the settings depending on the data that you want to analyse!
The control variables are described below:
-
seqfile
,outfile
, andtreefile
: these variables take the path (absolute/relative) to the input sequence file, the main output file, and the input tree file; respectively. Note that, if the control file is located in the same folder where you have the input files, then you do not need to write an absolute/relative path: you can just type the name of the file. Note that you should not have spaces as part of the file name. If, for any reason, you are using an absolute/relative path to define these variables and you have spaces, make sure you scape them with a\
. In general, try to avoid special characters in a file name as they might have special meanings under the OS. -
noisy
: this variable controls how much output you want on the screen. If the model being fitted involves much computation, you can choose a large number for noisy to avoid loneliness. -
verbose
: this variable controls how much output you want printed in your output file. -
runmode
: when this variable is set torunmode = 0
, evaluation of the tree topologies specified in the input tree file will take place. Ifrunmode = 1
or2
, a heuristic tree search by the star-decomposition algorithm will take place. Withrunmode = 2
, the algorithm starts from the star tree, while ifrunmode = 1
, the program will read a multifurcating tree from the input tree file and try to estimate the best bifurcating tree compatible with it. When usingrunmode = 3
, stepwise addition is enabled. Ifrunmode = 4
is specified, then NNI perturbation with the starting tree obtained by a parsimony algorithm will be used. Lastly, ifrunmode = 5
, NNI perturbation with the starting tree read from the input tree file will take place. The tree search options do not work well, and so userunmode = 0
as much as you can. For relatively small data sets, the stepwise addition algorithm seems usable. -
model
: this variable specifies the model of nucleotide substitution to be used (please see the "Substitution models" markdown file for a summary of how the rate matrices are parameterised inPAML)
. By specifyingmodel = 0
,model = 1
, untilmodel = 8
, users can enable the following models:JC69
(model = 0
),K80
(model = 1
),F81
(model = 2
),F84
(model = 3
),HKY85
(model = 4
),T92
(model = 5
),TN93
(model = 6
),REV
(also known as GTR,model = 7
), andUNREST
(model = 8
), respectively. Check Yang (1994c) or Yang (2006: Table 1.1) for notation (see also the "Substitution models" markdown file). Two more user-specified models are implemented as special cases of theREV
/GTR
model (model = 9
) and theUNREST
model (model = 10
). These models are calledREVu
andUNRESTu
, respectively. The format is illustrated in the following examples:model = 10 [0] /* JC69 */ model = 10 [1 (TC CT AG GA)] /* K80 */ model = 10 [11 (TA) (TG) (CT) (CA) (CG) (AT) (AC) (AG) (GT) (GC) (GA) ] /* UNREST */ model = 10 [5 (AC CA) (AG GA) (AT TA) (CG GC) (CT TC)] /* SYM */ model = 9 [2 (TA TG CA CG) (AG)] /* TN93 */
The number within square brackets
[]
is the number of free rate parameters; this is then followed by nucleotide pairs, which should have the same rates, grouped in the same pair of parentheses. For example, the line for SYM specifies anUNRESTu
model with 5 rates, with one rate for A→C and C→A changes, one rate for A→G and G→A changes, and so on. The model has a symmetrical rate matrixQ
, so that the equilibrium frequencies are$1/4$ for each nucleotide. Note that underUNREST
andUNRESTu
, the equilibrium nucleotide frequencies are not parameters; they are given by the rate parameters: see equation 1.56 in Yang (2006). Similarly,model = 9
specifies special cases of theREV
/GTR
model, orREVu
models. In one specifies TC or CT, but not both of them, since those are assumed to have the same rate (exchangeability) parameters. -
Mgene
andMalpha
: this variable is used in combination with optionG
in the sequence data file (please see the corresponding section in the "Data formatting" markdown file for details on how to specify the format of the input sequence file), for combined analysis of data from multiple genes or multiple site partitions (such as the three codon positions) (Yang 1996b; pages 116-8 in Yang 2006). Such models are also called partition models.The description here refers to multiple genes, but applies to any strategy of partitioning sites, such as by codon position. Similar partition models are implemented in
CODEML
for analysing codon or amino acid sequences. The simplest model, which assumes complete homogeneity among genes, can be fitted by concatenating different genes into one sequence without using the optionG
(and by specifyingMgene = 0
in the control file as well as not having optionG
in the input sequence file). The most general model is equivalent to a separate analysis, and can be done by fitting the same model to each data set (each gene), but can also be done by specifyingMgene = 1
with the optionG
in the combined input sequence file. The sum of the log-likelihood values over different genes is then the log likelihood of the most general model considered here. Models accounting for some aspects of the heterogeneity of multiple genes are fitted by specifyingMgene
in combination with the optionG
in the input sequence file. When specifyingMgene = 0
, the model assumes different substitution rates but the same pattern of nucleotide substitution for different genes. WhenMgene = 2
is specified, different frequency parameters for different genes are assumed but with the same rate ratio parameters ($\kappa$ in theK80
,F84
,HKY85
models or the rate parameters in theTN93
andREV
models). When specifyingMgene = 3
, different rate ratio parameters and the same frequency parameters are assumed. IfMgene = 4
, both different rate ratio parameters and different frequency parameters for different genes are assumed. Parameters and assumptions made in these models are summarised in Table 1 (see below), with theHKY85
model used as an example. When substitution rates are assumed to vary from site to site, the control variableMalpha
specifies whether one gamma distribution will be applied across all sites (Malpha = 0
) or a different gamma distribution is used for each gene (or codon position).Table 1. Setups of partition models with the
HKY85
nucleotide substitution model as an example.Sequence file Control file Parameters across genes No G Mgene = 0
everything equal Option G Mgene = 0
the same $\kappa$ and$\pi$ , but different $c$s (proportional branch lengths)Option G Mgene = 1
different $\kappa$ , $p$s, and different (not proportional) branch lengthsOption G Mgene = 2
the same $\kappa$ , but different $\pi$s and $c$sOption G Mgene = 3
the same $\pi$ , but different $\kappa$s and $c$sOption G Mgene = 4
different $\kappa$ , $p$s, and $c$sThe different $c$s for different genes mean that branch lengths estimated for different genes are proportional. Parameters
$\pi$ represent the equilibrium nucleotide frequencies, which are estimated using the observed frequencies (nhomo = 0
). The transition/transversion rate ratio$\kappa$ in theHKY85
substitution model can be replaced by the two or five rate ratio parameters under theTN93
orREV
models, respectively. The likelihood ratio test can be used to compare these models, using an approach called the analysis of deviance (McCullagh and Nelder 1989), which is very similar to the more familiar analysis of variance. -
ndata
: this variable specifies the number of alignment blocks in the input sequence file. Note that the different alignment blocks in the input sequence file must have the same sequence names (or tags). This variable may be useful for analysing simulated replicate datasets or gene alignments for the same set of species from genomic sequencing projects. For example, you can useevolver
to generate 100 replicate datasets using an input tree file and a set of parameters, and then setndata = 100
in the control file to runBASEML
orCODEML
(depending on using nucleotide, codon, or amino acid data) to analyse them. FromPAML
v4.10, I have added amaintree
option to accommodate the situation where some species are missing for some genes when the sequence data file has alignments for many genes from the same species. The user provides a main tree for all species, andBASEML
orCODEML
will prune the main tree to generate the gene trees, for the species present in each gene alignment. There are now four cases with this option variable, as follows. Please see theREADME.txt
file in the folderexamples/ndata/
for details and examples. You can also check thepositive-selection
PAML
GitHub repository for an example of how to run these four cases with codon data (the workflow would be the same but, instead of runningCODEML
and using codon data, you would have nucleotide data and runBASEML
). Below, you can find a summary of how to specify each case in the control file:-
(case a):
ndata = 100
, one tree block used for all datasets/alignments. -
(case b):
ndata = 100 separate_trees
, each alignment has its own tree block. -
(case c, default):
ndata = 100 maintree
, this uses themaintree
file to generate thegenetrees
/subtrees
and runs the ML analysis. -
(case c, using the third argument
1
, which is the default as shown above):ndata = 100 maintree 1
, this uses themaintree
file to generate thegenetrees
/subtrees
and runs the ML analysis. -
(case d):
ndata = 100 maintree 0
, this generates thegenetrees.trees
file but does not run ML.
-
(case a):
-
clock
: this variable specifies models concerning rate constancy or variation among lineages. When specifyingclock = 0
, no clock and rates are entirely free to vary from branch to branch. An unrooted tree should be used under this model. For a binary tree with$n$ species (sequences), this model has ($2n – 3$ ) parameters (branch lengths). Ifclock = 1
is specified, the global clock is assumed, and all branches have the same rate. This model has ($n – 1$ ) parameters corresponding to the ($n – 1$ ) internal nodes in the binary tree. So a test of the molecular clock assumption, which compares those two models, should have$df = n – 2$ .Under
clock = 2
, the local clock models are enabled (Yoder and Yang 2000; Yang and Yoder 2003). Most branches in the phylogeny conform with the clock assumption and have the default rate ($r_{0} = 1$ ), but some pre-defined branches may have different rates. Rates for branches are specified using branch labels (#
and$
) in the tree file (please check the "Data formatting" markdown file for details on this format). For example, the tree in Newick format(((1,2) #1, 3), 4);
specifies rate$r_{1}$ for the branch ancestral to species 1 and 2 while all other branches have the default rate$r_{0}$ , which does not have to be specified. The user need to specify which branch has which rate, and the program estimates the unknown rates (such as$r_{1}$ in the above example;$r_{0} = 1$ is the default rate). You need to be careful when specifying rates for branches to make sure that all rates can be estimated by the model; if you specify too many rate parameters, especially for branches around the root, it may not be possible to estimate all of them and you will have a problem with identifiability. The number of parameters for a binary tree in the local clock model is ($n – 1$ ) plus the number of extra rate parameters for branches. In the above examples tree with 4 species, you have only one extra rate parameter,$r_{1}$ , and so the local clock model has$(n – 1) + 1 = n = 4$ parameters. The no-clock model has 5 parameters while the global clock has 3 parameters for that tree.The option
clock = 3
is for combined analysis of multiple-gene or multiple-partition data, allowing the branch rates to vary in different ways among the data partitions (Yang and Yoder 2003). To account for differences in the evolutionary process among data partitions, you have to use the optionG
in the input sequence file (please see the corresponding section in the "Data formatting" markdown file for details on how to specify the format of the input sequence file) as well as the control variableMgene
in the control file (i.e.,baseml.ctl
when runningBASEML
orcodeml.ctl
when runningCODEML
; see details for this variable above). You can find more details about this variable in Yang and Yoder (2003) and check the README.txt file in theexamples/MouseLemurs/
folder to duplicate the analysis published in that paper. Also, you can useclock = 5
orclock = 6
to enable the ad hoc rate smoothing procedure of Yang (2004). For details on how to enable this model, please read theREADME2.txt
file in theexamples/MouseLemurs/
folder.For
clock = 1
orclock = 2
, a rooted tree should be used. If fossil calibration information is specified in the tree file using the symbol@
or=
, the absolute rate will be calculated (please check the "Data formatting" markdown file for details on this format). Multiple calibration points can be specified this way, but only point calibrations (where a node age is assumed to be known without error) are accepted and bounds are not accepted. This kind of calibration information (point calibration) is not used by the Bayesian dating programMCMCtree
. Please check theMCMCtree
documentation to learn about when you should use this dating program and what calibrations can be used. -
TipDate
: this variable is used to estimate ages of node ages on the rooted tree when the sequences at the tips having sampling dates, as in the case of sequentially sampled viral sequences. The sample dates are the last field in the sequence name. The time unit is specified by the user on this line. Please check theREADME.txt
file in theexamples/TipDate.HIV2/
folder for more details on when to use this variable and how to use it. -
fix_kappa
andkappa
: this variable specifies whether$\kappa$ inK80
,F84
, orHKY85
is given at a fixed value or is to be estimated by iteration from the data. Iffix_kappa = 1
, the value of another variable,kappa
, is the given value. Otherwise, iffix_kappa = 0
is used in the control file, then the value specified in variablekappa
is used as the initial estimate for iteration. The variablesfix_kappa
andkappa
have no effect withJC69
orF81
which do not involve such a parameter, or withTN93
andREV
which have two and five rate parameters, respectively, when all of them are estimated from the data. The optionfix_kappa = 2
is used together withnhomo = 5
to specify nonstationary models inBASEML
in which the whole rate matrixQ
(both parameter$\kappa$ or the exchangeability parameters and the base frequency parameters) vary among branches. See the description of variablenhomo
below. -
fix_alpa
,alpha
, andncatG
: these variables work in a similar way asfix_kappa
andalpha
. In this case,alpha
refers to the shape parameter$\alpha$ of the gamma distribution for variable substitution rates across sites (Yang 1994b). The model of a single rate for all sites is specified by settingfix_alpha = 1
andalpha = 0
in the control file (i.e., using0
inalpha = 0
means infinity), while the (discrete-) gamma model is specified by a positive value foralpha
. VariablencatG
is then the number of categories for the discrete-gamma model (BASEML
). Values such as 5, 4, 8, or 10 are reasonable. Note that the discrete gamma model has one parameter ($\alpha$ ), like the continuous gamma model, and the number of categories is not a parameter.
Important
The model of invariable sites is not implemented in PAML
programs. I don’t like the model as it generates a strong correlation between the proportion of invariable sites and the gamma shape parameter. To infer rates at individual sites, use RateAncestor = 1
(see more details about this variable below). Using a large number of categories (say, ncatG = 40
) may be helpful if you are interested in calculating such rates. For detailed descriptions of those models, see Yang (1996a), Chapter 1 of Yang (2006), and chapters 13 and 16 of Felsenstein (2004).
-
fix_rho
andrho
: these variables work in a similar way and concern independence or correlation of rates at adjacent sites, where$\rho$ (i.e., variablerho
in the control file) is the correlation parameter of the auto-discrete-gamma model (Yang 1995). The model of independent rates for sites is specified by settingfix_rho = 1
andrho = 0
in the control file, if choosing to specify alsoalpha = 0
further means a constant rate for all sites. The auto-discretegamma model is specified by positive values for bothalpha
andrho
variables in the control file. The model of a constant rate for sites is a special case of the (discrete) gamma model, which can be enabled by specifying$\alpha=\infty$ (i.e., variablealpha = 0
), and the model of independent rates for sites is a special case of the auto-discrete-gamma model, which can be enabled by setting$\rho = 0$ (i.e., variablerho = 0
). -
nparK
: this variable specifies nonparametric models for variable and Markov-dependent rates across sites. Specifically, when settingnparK = 1
ornaprK = 2
, several categories of independent rates for sites are enabled (i.e., specified with variablencatG
in the control file), whilenparK = 3
ornparK = 4
means the rates are Markov-dependent at adjacent sites. Note thatnparK = 1
andnparK= 3
have the restriction that each rate category has equal probability, whilenparK = 2
andnparK = 4
do not have this restriction (Yang 1995). Please note that variablenparK
takes precedence over variablesalpha
orrho
. -
nhomo
: this variable is only available forBASEML
, and it is used to implement nonstationary models in which the base compositions change along branches on the tree. The basic substitution model (i.e. which can be enabled via optionmodel
in the control file) should be one that has frequency parameters (i.e.,F81
,F84
,HKY85
,T92
,TN93
,REV
/GTR
). When specifyingnhomo = 1
, a homogeneous model is fitted, but the frequency parameters$\pi_{T}$ ,$\pi_{C}$ , and$\pi_{A}$ (i.e., note that$\pi_{G}$ is not a free parameter as the frequencies need to sum up to 1) are estimated by maximum likelihood iteration. This case applies toF81
,F84
,HKY85
,T92
(in which case only$\pi_{GC}$ is a parameter),TN93
, orREV
models. The default option in the control file isnhomo = 0
, which means that the frequency parameters are estimated by the averages of the observed frequencies. For bothnhomo = 0
andnhomo = 1
, you should count 3 (or 1 forT92
)free parameters
for the base frequencies (see above).Note that, when setting
nhomo = 2
, then one transition/transversion rate ratio ($\kappa$ ) is used for each branch in the input tree file for theK80
,F84
, andHKY85
models (Yang 1994c; Yang and Yoder 1999).Options
nhomo = 3
,nhomo = 4
, andnhomo = 5
are used to fit nonhomogeneous models in which base compositions drift on the tree (Yang and Roberts 1995; Galtier and Gouy 1998). The nucleotide substitution is specified by the variablemodel
and is one ofF84
,HKY85
,T92
,TN93
, andREV
/GTR
. Different frequency parameters are used in the rate matrix for different branches in the tree (please check the rate matrices implemented inPAML
programs in the "Substitution models" markdown file). Those options may be used together withfix_kappa = 2
to allow the exchangeability parameters in the substitution rate matrix ($\kappa$ inF84
,HKY85
, andT92
;$\kappa_{1}$ and$\kappa_{2}$ inTN93
; anda
,b
,c
,d
,e
inREV
/GTR
) to differ among branches of the tree as well. Thus the wholeQ
matrix can be different for different branches (seeQ
matrices in the "Substitution models" markdown file).IMPORTANT
The position of the root may make a difference to the likelihood, so that in theory it may be necessary to use rooted trees. Nevertheless, parameters around the root, including the two branch lengths around the root, are expected to be poorly estimated since the model is only weakly identifiable. In short, you need to decide whether it is appropriate to use a rooted tree or unrooted tree (you can help inform your decision by reading section "Rooted versus unrooted trees in the supplementary material in Álvarez-Carretero et al. 2023).Because of the parameter richness, the models may only work if you have very long sequences. Yang and Roberts (1995) used the
HKY85
orF84
models, and so three independent frequency parameters are used to describe the substitution pattern, while Galtier and Gouy (1998) used theT92
substitution model and uses theGC
content$\pi_{GC}$ only, with the base frequencies give as$\pi_{T} = \pi_{A} = (1 – \pi_{GC})/2$ and$\pi_{C} = \pi_{G} = \pi_{GC}/2$ . The optionnhomo = 4
assigns one set of frequency parameters for the root, which are the initial base frequencies at the root, and one set for each branch in the tree. This is modelN2
in Yang and Roberts (1995) if the substitution model isF84
orHKY85
or the model of Galtier and Gouy (1998) if the substitution model isT92
. When specifying optionnhomo = 3
, the following are used: one set of base frequencies for each tip branch, one set for all internal branches in the tree, and one set for the root. This is modelN1
in Yang and Roberts (1995). If using optionnhomo = 5
, the user needs to specify how many sets of frequency parameters should be used and which branch should use which set. The set for the root specifies the initial base frequencies at the root, while the set for any other node is for parameters in the substitution matrix along the branch leading to the node. Node (branch) labels in the tree file are to be used to tell the program which set should use (please see subsection “Tree file format and representations of tree topology” in the "Data formatting" markdown file for more details). There is no need to specify the default set with any label (i.e., set 0). For instance, when specifyingnhomo = 5
and the following tree in the tree file, there are four sets that label four tip branches (i.e.,#1
,#2
,#3
, and#4
), set 5 is the root (i.e., label#5
), and all the internal branches (nodes) will be the default set 0 (i.e., no labels required). This is equivalent to specifyingnhomo = 3
.4 1 (((1 #1, 2 #2), 3 #3), 4 #4) #5;
The output for
nhomo = 3
,nhomo = 4
, andnhomo = 5
is under the heading(frequency parameters for branches) [frequencies at nodes] (see Yang & Roberts 1995 fig 1)
in the output file. Two sets of frequencies are listed for each node. The first set are the parameters (used in the substitution rate matrix for the branch leading to the node), and the second set are the expected base frequencies at the node, calculated from the model (Yang and Roberts 1995; page 456 column top). If the node is the root, the same set of frequencies are printed twice.Note that the use of the variable
fix_kappa
together withnhomo = 3
,nhomo = 4
, ornhomo = 5
in the control file is somewhat unusual. When settingfix_kappa = 1
, one common$\kappa$ is assumed and estimated for all branches, whilefix_kappa = 0
means that one$\kappa$ is estimated for each branch. When specifyingfix_kappa = 2
, the same branch/node labels are used to specify the exchangeability parameters (i.e.,$\kappa$ inF84
,HKY85
, andT92
;$\kappa_{1}$ and$\kappa_{2}$ inTN93
, and$a$ ,$b$ ,$c$ ,$d$ , and$e$ inREV
/GTR
) for branches. As an example, suppose we have the following options in the control file:model = 7 * REV/GTR model nhomo = 5 fix_kappa = 2
In addition, suppose the input tree file has the following node/branch labels:
4 1 (((1 #1, 2 #2), 3 #3), 4 #4) #5;
In this case, there will be 49 parameters: 6 branch lengths in the rooted tree, 5 sets of exchangeability parameters (i.e.,
$a$ ,$b$ ,$c$ ,$d$ , and$e$ ), and 6 sets of base frequency parameters, with$6 + 5 \times 5 + 6 \times 3 = 49$ . In other words, the model will assume differentGTR
rate matrices for the four tip branches (with 8 free parameters in eachQ
matrix), and oneQ
matrix for the two internal branches, plus the initial base frequencies at the root, plus the 6 branch lengths, with$5 \times 8 + 3 + 6 = 49$ .IMPORTANT
Note that the same labels in the input tree file are used to label both the nodes (for the frequencies) and the branches (for the exchangeability parameters in the rate matrix). An awkward issue is that we need a set of base frequencies at the root node, but we do not need a set of exchangeability parameters for the root branch. Thus, with a rooted tree of$s$ species, there are$2s – 1$ sets of frequencies and$2s – 2$ sets of exchangeability parameters. If the root has a separate frequency parameter, please use the last (largest) label for the root. Following the example above, the root node has been identified with label#5
as it will have a separate set of frequency parameters. In that way, there are six sets of frequency parameters identified by labels#0
,#1
,#2
,#3
,#4
, and#5
(i.e., these labels identify nodes), while the first five labels (i.e.,#0
,#1
,#2
,#3
, and#4
) are reused to identify the five sets of branches for the exchangeability parameters.As another example, the following two Newick trees with labels to identify nodes/branches would be equivalent:
(((1 #1, 2 #1), 3), 4); (((1 , 2) #1, 3 #1) #1, 4 #1) #1;
Either of them will fit the same model, which assumes that the base frequencies have been stationary until the ancestor of species 1 and 2, at which point the two lineages 1 and 2 have been drifting away from the old frequencies. Note that all branches and nodes are labelled, with the default to be
#0
. In particular, the root node in the first tree has the label#0
.When setting options
nhomo = 3
,nhomo = 4
, ornhomo = 5
, the Expected Markov counting (EMC) method of Matsumoto et al. (2015) generates printouts under the headingExpected numbers of nucleotide changes on branches expected
in the output file. This is available for the model of one rate for all sites only whenfix_alpha = 1
andalpha = 0
are specified in the control file. -
getSE
: this variable can be set togetSE = 0
,getSE = 1
, orgetSE = 2
. These three options tell whether we want estimates of the standard errors of estimated parameters. These are crude estimates, calculated by the curvature method, i.e., by inverting the matrix of second derivatives of the log-likelihood with respect to parameters. The second derivatives are calculated by the difference method and are not always reliable. Even if this approximation is reliable, tests relying on the SE's should be taken with caution, as such tests rely on the normal approximation to the maximum likelihood estimates. The likelihood ratio test should always be preferred. Please specifygetSE = 0
when tree-search is performed as this option is not available under this scenario. -
RateAncestor
: this variable can be set to0
or1
. IfRateAncestor = 1
, the program is forced to do two additional analyses, which you can ignore if you do not need the results. First, under a model of variable rates across sites such as the gamma,RateAncestor = 1
forces the program to calculate rates for individual sites along the sequence (output that will be saved in the outputrates
file), using the empirical Bayes procedure (Yang and Wang 1995). Secondly,RateAncestor = 1
forces the program to perform the empirical Bayesian reconstruction of ancestral sequences (Yang et al. 1995b; Koshi and Goldstein 1996; [Yang 2006 pages 119-124(http://abacus.gene.ucl.ac.uk/CME/)]). Ancestral state reconstruction by parsimony (Fitch 1971; Hartigan 1973) is well known (i.e., it is implemented in thePAML
programpamp
). It can also be achieved using the likelihood/empirical Bayes approach. Often, the two approaches produce similar results, but the likelihood-based reconstruction has two advantages over parsimony: this method uses information from the branch lengths and the relative substitution rates between characters (nucleotides) and provides a measure of uncertainties in the form of posterior probabilities for reconstructed ancestral states.The results are listed, by site, in the output file
rst
. You can also use the variableverbose
to control how much information you want to be written in this output file. Ifverbose = 0
, only the best nucleotide (the one with the highest posterior probability) at each node at each site is listed, while withverbose = 1
(try2
if1
does not work), the full posterior probability distribution from the marginal reconstruction is listed. If the model is homogenous (i.e., ifnhomo = 0
ornhomo = 1
have been specified in the control file) and assumes one rate for all sites, both the joint and marginal ancestral reconstructions will be calculated. If the model assumes variable rates among sites like the gamma model, only the marginal reconstructions are calculated. More details about ancestral sequence reconstruction in the section below -
Small_Diff
: this variable is a small value used in the difference approximation of derivatives. Use a value between$1e-8$ and$1e-5$ and check that the results are not sensitive to the value used by changing the variableSmall_Diff
in the control file, which will show whether the results are stable regardless of the value used. -
cleandata
: this variable controls whether ambiguity characters and gaps are to be removed from the input sequence file. Ifcleandata = 1
, then sites involving ambiguity characters (undetermined nucleotides such asN
,?
,W
,R
,Y
, etc., anything other than the four nucleotides) or alignment gaps are removed from all sequences. Ifcleandata = 0
(i.e., the default option), those sites are kept. Please not that we recommend that sequences in the input sequence file are already aligned and columns or regions that are predominantly gaps or are hard to align have been dealt with prior to runningBASEML
; then usecleandata = 0
to preserve information in the data. -
method
: this variable controls the iteration algorithm for estimating branch lengths under a model of no clock. Ifmethod = 0
, the old algorithm implemented inPAML
is used, which updates all parameters including branch lengths simultaneously. The later algorithm implemented inPAML
can be enabled by settingmethod = 1
, an algorithm that updates branch lengths one by one. Please note thatmethod = 1
does not work under the clock models (i.e., when specifyingclock = 1
,clock = 2
, orclock = 3
). -
icode
: this variable specifies the genetic code to be used for ancestral reconstruction of protein-coding DNA sequences. This is implemented to compare results of ancestral reconstruction with codon-based analysis. For example the "F3×4" codon model of Goldman and Yang (1994) is very similar to the nucleotide modelHKY85
with different substitution rates and base frequencies for the three codon positions. The latter is implemented by using use optionsGC
in the input sequence file (please check the "Data formatting" markdown file for details on how to specify this option in the input sequence file) together by including optionsmodel = 4
andMgene = 4
in the control file. To use the optionicode
, you have to enable optionRateAncestor = 1
. Note that there the following genetic codes have been implemented inPAML
, which are enabled by specifying values the setting theicode
variable to the following options:-
0
: universal, -
1
: mammalian mt. -
2
: yeast mt. -
3
: mold mt. -
4
: invertebrate mt. -
5
: ciliate nuclear -
6
: echinoderm mt. -
7
: euplotid mt. -
8
: alternative yeast nu. -
9
: ascidian mt. -
10
: blepharisma nu.
NOTE
These codes correspond totransl_table
1 to 11 of GENEBANK. -
-
fix_blength
: this variable tells the program what to do if the tree has branch lengths. When settingfix_blength = 0
(i.e., "ignore"), branch lengths in the input tree file (if they have been incorporated by the user) will be ignored and initial branch lengths are generated using pairwise distances and random numbers. Iffix_blength = –1
(i.e., "random"), then random initial branch lengths will be used. This option generates very poor branch lengths and might be useful if there are multiple local optima on the likelihood surface. When specifying,fix_blength = 1
(i.e., "initial2), branch lengths in the tree file will be used as initial values for ML iteration. Please try to avoid incorporating in your input tree file branch lengths from a parsimony analysis, as those are numbers of changes for the entire sequence (rather than per site) and are very poor initial values. When settingfix_blength = 2
(i.e., "fixed"), branch lengths will be fixed at those values given in the tree file and not estimated by ML. In this case, you should make sure that the branch lengths are sensible; for example, if two sequences in the data file are different, but the branch lengths connecting the two sequences in the tree are all zero, the input sequence and the input tree will be in conflict and the program will abort. Lastly, when specifyingfix_blength = 3
(i.e., "proportional"), branch lengths will be proportional to those given in the tree file, and the proportionality factor is estimated by ML.
The output should be self-explanatory. Descriptive statistics are always listed. The observed site patterns and their frequencies are listed, together with the proportions of constant patterns. Nucleotide frequencies for each species (and for each gene in case of multiple gene data) are counted and listed. Note that lmax = ln(Lmax)
is the upper limit of the log likelihood and may be compared with the likelihood for the best (or true, if simulations) tree under the substitution model to test the model's goodness of fit to data (Goldman 1993; Yang et al. 1995a). You can ignore this if you do not know what it means.
The pairwise sequence distances are included in the output as well, and also in a separate file called 2base.t
. Note that this matrix is a lower-diagonal distance matrix, readable by the NEIGHBOR
program in Felsenstein's PHYLIP package (Felsenstein 2005). For models JC69
, K80
, F81
, and F84
, the appropriate distance formulas are used, while for more complex models, the TN93
formula is used. BASEML
is mainly a maximum likelihood program, and the distance matrix is printed out for convenience and really has nothing to do with the later likelihood calculation.
If you enable variable getSE = 1
in the control file, the standard errors (S.E.s) are calculated as the square roots of the large sample variances and listed exactly below the parameter estimates. Zeros on this line mean errors, either caused by divergence of the algorithm or zero branch lengths. The S.E.s of the common parameters measure the reliability of the estimates. For example, K80
, can be compared with a normal distribution to see whether there is real difference between K80
and JC69
. The test can be more reliably performed by comparing the log-likelihood values under the two models, using the likelihood ratio test. It has to be stressed that the S.E.’s of the estimated branch lengths should not be misinterpreted as an evaluation of the reliability of the estimated tree topology (Yang 1994a).
If the tree file has more than one "tree block" (i.e., there are more than one tree in Newick format specified in the input tree file), the programs BASEML
and CODEML
will calculate the bootstrap proportions using the RELL method (Kishino and Hasegawa 1989), as well as the method of Shimodaira and Hasegawa (1999) with a correction for multiple comparison. The bootstrap resampling accounts for possible data partitions (option G
in the input sequence file, please check the "Data formatting" markdown file for details on how to specify this option in the input sequence file).
A good use of ancestral sequence reconstruction is to synthesise the inferred ancestral proteins and measure their biochemical properties in the lab (Pauling and Zuckerkandl 1963; Chang and Donoghue 2000; Thornton 2004). It is also very popular to use reconstructed ancestral sequences as if they were real observed data to perform further analysis. You should resist this irresistible temptation and use full likelihood methods if they are available (e.g., Yang 2002). See section 4.4.4 in Yang (2006) for a discussion of systematic biases in ancestral reconstruction.
For nucleotide-based analysis of protein coding DNA sequences (i.e., using BASEML
with option GC
in the input sequence file, see corresponding section in "Data formatting" markdown file for details on this format), the program also calculates the posterior probabilities of ancestral amino acids. In this analysis, branch lengths and other parameters are estimated under a nucleotide substitution model, but the reconstructed nucleotide triplets are treated as a codon to infer the most likely amino acid encoded. Posterior probabilities for stop codons are small and reset to zero to scale the posterior probabilities for amino acids. To use this option, you should add the variable icode
in the control file baseml.ctl
. Note that variable icode
can take a value out that ranges from 0
to 11
, which correspond to the 12 genetic codes included in PAML
(see details for variable icode
above).
A nucleotide substitution model that is very close to a codon-substitution model can be specified as follows. You add the option characters GC
at the end of the first line in the input sequence file and set model = 4
(HKY85
) and Mgene = 4
in the control file. The model then assumes different substitution rates, different base frequencies, and different transition/transversion rate ratio (£\kappa$) for the three codon positions. Ancestral reconstruction from such a nucleotide substitution should be very similar to codon-based reconstruction.
I have added the option of joint ancestral reconstruction under the nonhomogeneous models. The variables that enable this feature are nhomo = 3
(N1
in YR1995), nhomo = 4
(N2
in YR1995), and nhomo = 5
(user-specified branch types) under the nucleotide substitution models enabled when specifying model = 3
(F84
), model = 4
(HKY85
), model = 5
(T92
), model = 6
(TN93
), or model = 7
(REV
/GTR
). This approach works only for the model of one rate for all sites, and does not work for the model of gamma rates for sites or the partition models (i..e, option G
). Only joint reconstruction is available as the algorithm I used for marginal reconstruction does not work for nonhomogeneous models.
Marginal reconstruction considers character assignments to one single interior node and the character with the highest posterior probability is the best reconstruction (eq. 4 in Yang et al. 1995b; or Eq. 4.15 in Yang 2006). The algorithm for marginal reconstruction implemented in PAML
works under both the model of a constant rate for all sites and the gamma model of rates at sites. Joint reconstruction considers all ancestral nodes at the same time and the reconstruction (the set of characters at a site assigned to all interior nodes) with the highest posterior probability is the best reconstruction (eq. 2 in Yang et al. 1995b; or Eq. 4.16 in Yang 2006). The algorithm for joint reconstruction implemented in PAML
is based on that of Pupko et al. (2000), which gives the best reconstruction at each site and its posterior probability. This method works under the model of one rate for all sites only (note that this approach works under the partition models).
The marginal and joint approaches use slightly different criteria, and expected to produce consistent results; that is, the most probable joint reconstruction for a site should almost always consist of characters that are also the best in the marginal reconstruction. Conflicts may arise in borderline cases where two or more reconstructions have similar posterior probabilities.
© Copyright 1993-2023 by Ziheng Yang
The software package is provided "as is" without warranty of any kind. In no event shall the author or their employer be held responsible for any damage resulting from the use of this software, including but not limited to the frustration that you may experience in using the package. The program package, including source codes, example data sets, executables, and this documentation is maintained by Ziheng Yang and distributed under the GNU GPL v3.
Ziheng Yang
Department of Genetics, Evolution, and Environment
University College London
Gower Street
WC1E 6BT, London, United Kingdom