Skip to content
sabifo4 edited this page Nov 13, 2024 · 5 revisions

BASEML

Control file

The default control file for BASEML is baseml.ctl. Please consider the following when generating your control file:

  • Spaces are required on both sides of the equal sign
  • Blank lines or lines beginning with * are treated as comments.
  • Options not used can be deleted from the control file.
  • The order of the variables is not important.

You can find an example of a control file to execute the PAML program BASEML below:

seqfile      = brown.nuc   * path to sequence data file
outfile      = mlb         * path to main result file
treefile     = brown.trees * path to tree file

noisy        = 3 * 0,1,2,3: how much rubbish on the screen
verbose      = 0 * 1: detailed output, 0: concise output

runmode      = 0 * 0: user tree; 1: semi-automatic; 2: automatic
                 * 3: StepwiseAddition; (4,5):PerturbationNNI

model        = 5     * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
                     * 5:T92, 6:TN93, 7:REV, 8:UNREST, 9:REVu; 10:UNRESTu
Mgene        = 0     * 0:rates, 1:separate; 2:diff pi, 3:diff kapa, 4:all diff
* ndata        = 1   * number of data sets
clock        = 0     * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
TipDate      = 0 100 * TipDate (1) & time unit
fix_kappa    = 0     * 0: estimate kappa; 1: fix kappa at value below; 2: kappa for branches
kappa        = 2.5   * initial or fixed kappa
fix_alpha    = 1     * 0: estimate alpha; 1: fix alpha at value below
alpha        = 0.    * initial or fixed alpha, 0:infinity (constant rate)
Malpha       = 0     * 1: different alpha's for genes, 0: one alpha
ncatG        = 5     * # of categories in the dG, AdG, or nparK models of rates
fix_rho      = 1     * 0: estimate rho; 1: fix rho at value below
rho          = 0.    * initial or fixed rho, 0:no correlation
nparK        = 0     * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK

nhomo        = 0     * 0 & 1: homogeneous, 2: kappa for branches, 3: N1, 4: N2, 5: user
getSE        = 0     * 0: don't want them, 1: want S.E.s of estimates
RateAncestor = 0     * (0,1,2): rates (alpha>0) or ancestral states
Small_Diff   = 1e-6  * small value used in the difference approximation of derivatives
method       = 0     * 0: simultaneous; 1: one branch at a time
* cleandata    = 1 * remove sites with ambiguity data (1:yes, 0:no)?
* icode        = 0 * (RateAncestor=1 for coding genes, "GC" in data)
* fix_blength  = 0 * 0: ignore, -1: random, 1: initial, 2: fixed, 3: proportional

NOTE
This is just an example; you may have to remove some options and/or modify the settings depending on the data that you want to analyse!

The control variables are described below:

  • seqfile, outfile, and treefile: these variables take the path (absolute/relative) to the input sequence file, the main output file, and the input tree file; respectively. Note that, if the control file is located in the same folder where you have the input files, then you do not need to write an absolute/relative path: you can just type the name of the file. Note that you should not have spaces as part of the file name. If, for any reason, you are using an absolute/relative path to define these variables and you have spaces, make sure you scape them with a \. In general, try to avoid special characters in a file name as they might have special meanings under the OS.

  • noisy: this variable controls how much output you want on the screen. If the model being fitted involves much computation, you can choose a large number for noisy to avoid loneliness.

  • verbose: this variable controls how much output you want printed in your output file.

  • runmode: when this variable is set to runmode = 0, evaluation of the tree topologies specified in the input tree file will take place. If runmode = 1 or 2, a heuristic tree search by the star-decomposition algorithm will take place. With runmode = 2, the algorithm starts from the star tree, while if runmode = 1, the program will read a multifurcating tree from the input tree file and try to estimate the best bifurcating tree compatible with it. When using runmode = 3, stepwise addition is enabled. If runmode = 4 is specified, then NNI perturbation with the starting tree obtained by a parsimony algorithm will be used. Lastly, if runmode = 5, NNI perturbation with the starting tree read from the input tree file will take place. The tree search options do not work well, and so use runmode = 0 as much as you can. For relatively small data sets, the stepwise addition algorithm seems usable.

  • model: this variable specifies the model of nucleotide substitution to be used (please see the "Substitution models" markdown file for a summary of how the rate matrices are parameterised in PAML). By specifying model = 0, model = 1, until model = 8, users can enable the following models: JC69 (model = 0), K80 (model = 1), F81 (model = 2), F84 (model = 3), HKY85 (model = 4), T92 (model = 5), TN93 (model = 6), REV (also known as GTR, model = 7), and UNREST (model = 8), respectively. Check Yang (1994c) or Yang (2006: Table 1.1) for notation (see also the "Substitution models" markdown file). Two more user-specified models are implemented as special cases of the REV/GTR model (model = 9) and the UNREST model (model = 10). These models are called REVu and UNRESTu, respectively. The format is illustrated in the following examples:

    model = 10 [0] /* JC69 */
    model = 10 [1 (TC CT AG GA)] /* K80 */
    model = 10 [11 (TA) (TG) (CT) (CA) (CG) (AT) (AC) (AG) (GT) (GC) (GA) ] /* UNREST */
    model = 10 [5 (AC CA) (AG GA) (AT TA) (CG GC) (CT TC)] /* SYM */
    model = 9 [2 (TA TG CA CG) (AG)] /* TN93 */
    

    The number within square brackets [] is the number of free rate parameters; this is then followed by nucleotide pairs, which should have the same rates, grouped in the same pair of parentheses. For example, the line for SYM specifies an UNRESTu model with 5 rates, with one rate for A→C and C→A changes, one rate for A→G and G→A changes, and so on. The model has a symmetrical rate matrix Q, so that the equilibrium frequencies are $1/4$ for each nucleotide. Note that under UNREST and UNRESTu, the equilibrium nucleotide frequencies are not parameters; they are given by the rate parameters: see equation 1.56 in Yang (2006). Similarly, model = 9 specifies special cases of the REV/GTR model, or REVu models. In one specifies TC or CT, but not both of them, since those are assumed to have the same rate (exchangeability) parameters.

  • Mgene and Malpha: this variable is used in combination with option G in the sequence data file (please see the corresponding section in the "Data formatting" markdown file for details on how to specify the format of the input sequence file), for combined analysis of data from multiple genes or multiple site partitions (such as the three codon positions) (Yang 1996b; pages 116-8 in Yang 2006). Such models are also called partition models.

    The description here refers to multiple genes, but applies to any strategy of partitioning sites, such as by codon position. Similar partition models are implemented in CODEML for analysing codon or amino acid sequences. The simplest model, which assumes complete homogeneity among genes, can be fitted by concatenating different genes into one sequence without using the option G (and by specifying Mgene = 0 in the control file as well as not having option G in the input sequence file). The most general model is equivalent to a separate analysis, and can be done by fitting the same model to each data set (each gene), but can also be done by specifying Mgene = 1 with the option G in the combined input sequence file. The sum of the log-likelihood values over different genes is then the log likelihood of the most general model considered here. Models accounting for some aspects of the heterogeneity of multiple genes are fitted by specifying Mgene in combination with the option G in the input sequence file. When specifying Mgene = 0, the model assumes different substitution rates but the same pattern of nucleotide substitution for different genes. When Mgene = 2 is specified, different frequency parameters for different genes are assumed but with the same rate ratio parameters ($\kappa$ in the K80, F84, HKY85 models or the rate parameters in the TN93 and REV models). When specifying Mgene = 3, different rate ratio parameters and the same frequency parameters are assumed. If Mgene = 4, both different rate ratio parameters and different frequency parameters for different genes are assumed. Parameters and assumptions made in these models are summarised in Table 1 (see below), with the HKY85 model used as an example. When substitution rates are assumed to vary from site to site, the control variable Malpha specifies whether one gamma distribution will be applied across all sites (Malpha = 0) or a different gamma distribution is used for each gene (or codon position).

    Table 1. Setups of partition models with the HKY85 nucleotide substitution model as an example.

    Sequence file Control file Parameters across genes
    No G Mgene = 0 everything equal
    Option G Mgene = 0 the same $\kappa$ and $\pi$, but different $c$s (proportional branch lengths)
    Option G Mgene = 1 different $\kappa$, $p$s, and different (not proportional) branch lengths
    Option G Mgene = 2 the same $\kappa$, but different $\pi$s and $c$s
    Option G Mgene = 3 the same $\pi$, but different $\kappa$s and $c$s
    Option G Mgene = 4 different $\kappa$, $p$s, and $c$s

    The different $c$s for different genes mean that branch lengths estimated for different genes are proportional. Parameters $\pi$ represent the equilibrium nucleotide frequencies, which are estimated using the observed frequencies (nhomo = 0). The transition/transversion rate ratio $\kappa$ in the HKY85 substitution model can be replaced by the two or five rate ratio parameters under the TN93 or REV models, respectively. The likelihood ratio test can be used to compare these models, using an approach called the analysis of deviance (McCullagh and Nelder 1989), which is very similar to the more familiar analysis of variance.

  • ndata: this variable specifies the number of alignment blocks in the input sequence file. Note that the different alignment blocks in the input sequence file must have the same sequence names (or tags). This variable may be useful for analysing simulated replicate datasets or gene alignments for the same set of species from genomic sequencing projects. For example, you can use evolver to generate 100 replicate datasets using an input tree file and a set of parameters, and then set ndata = 100 in the control file to run BASEML or CODEML (depending on using nucleotide, codon, or amino acid data) to analyse them. From PAML v4.10, I have added a maintree option to accommodate the situation where some species are missing for some genes when the sequence data file has alignments for many genes from the same species. The user provides a main tree for all species, and BASEML or CODEML will prune the main tree to generate the gene trees, for the species present in each gene alignment. There are now four cases with this option variable, as follows. Please see the README.txt file in the folder examples/ndata/ for details and examples. You can also check the positive-selection PAML GitHub repository for an example of how to run these four cases with codon data (the workflow would be the same but, instead of running CODEML and using codon data, you would have nucleotide data and run BASEML). Below, you can find a summary of how to specify each case in the control file:

    • (case a): ndata = 100, one tree block used for all datasets/alignments.
    • (case b): ndata = 100 separate_trees, each alignment has its own tree block.
    • (case c, default): ndata = 100 maintree, this uses the maintree file to generate the genetrees/subtrees and runs the ML analysis.
    • (case c, using the third argument 1, which is the default as shown above): ndata = 100 maintree 1, this uses the maintree file to generate the genetrees/subtrees and runs the ML analysis.
    • (case d): ndata = 100 maintree 0, this generates the genetrees.trees file but does not run ML.
  • clock: this variable specifies models concerning rate constancy or variation among lineages. When specifying clock = 0, no clock and rates are entirely free to vary from branch to branch. An unrooted tree should be used under this model. For a binary tree with $n$ species (sequences), this model has ($2n – 3$) parameters (branch lengths). If clock = 1 is specified, the global clock is assumed, and all branches have the same rate. This model has ($n – 1$) parameters corresponding to the ($n – 1$) internal nodes in the binary tree. So a test of the molecular clock assumption, which compares those two models, should have $df = n – 2$.

    Under clock = 2, the local clock models are enabled (Yoder and Yang 2000; Yang and Yoder 2003). Most branches in the phylogeny conform with the clock assumption and have the default rate ($r_{0} = 1$), but some pre-defined branches may have different rates. Rates for branches are specified using branch labels (# and $) in the tree file (please check the "Data formatting" markdown file for details on this format). For example, the tree in Newick format (((1,2) #1, 3), 4); specifies rate $r_{1}$ for the branch ancestral to species 1 and 2 while all other branches have the default rate $r_{0}$, which does not have to be specified. The user need to specify which branch has which rate, and the program estimates the unknown rates (such as $r_{1}$ in the above example; $r_{0} = 1$ is the default rate). You need to be careful when specifying rates for branches to make sure that all rates can be estimated by the model; if you specify too many rate parameters, especially for branches around the root, it may not be possible to estimate all of them and you will have a problem with identifiability. The number of parameters for a binary tree in the local clock model is ($n – 1$) plus the number of extra rate parameters for branches. In the above examples tree with 4 species, you have only one extra rate parameter, $r_{1}$, and so the local clock model has $(n – 1) + 1 = n = 4$ parameters. The no-clock model has 5 parameters while the global clock has 3 parameters for that tree.

    The option clock = 3 is for combined analysis of multiple-gene or multiple-partition data, allowing the branch rates to vary in different ways among the data partitions (Yang and Yoder 2003). To account for differences in the evolutionary process among data partitions, you have to use the option G in the input sequence file (please see the corresponding section in the "Data formatting" markdown file for details on how to specify the format of the input sequence file) as well as the control variable Mgene in the control file (i.e., baseml.ctl when running BASEML or codeml.ctl when running CODEML; see details for this variable above). You can find more details about this variable in Yang and Yoder (2003) and check the README.txt file in the examples/MouseLemurs/ folder to duplicate the analysis published in that paper. Also, you can use clock = 5 or clock = 6 to enable the ad hoc rate smoothing procedure of Yang (2004). For details on how to enable this model, please read the README2.txt file in the examples/MouseLemurs/ folder.

    For clock = 1 or clock = 2, a rooted tree should be used. If fossil calibration information is specified in the tree file using the symbol @ or =, the absolute rate will be calculated (please check the "Data formatting" markdown file for details on this format). Multiple calibration points can be specified this way, but only point calibrations (where a node age is assumed to be known without error) are accepted and bounds are not accepted. This kind of calibration information (point calibration) is not used by the Bayesian dating program MCMCtree. Please check the MCMCtree documentation to learn about when you should use this dating program and what calibrations can be used.

  • TipDate: this variable is used to estimate ages of node ages on the rooted tree when the sequences at the tips having sampling dates, as in the case of sequentially sampled viral sequences. The sample dates are the last field in the sequence name. The time unit is specified by the user on this line. Please check the README.txt file in the examples/TipDate.HIV2/ folder for more details on when to use this variable and how to use it.

  • fix_kappa and kappa: this variable specifies whether $\kappa$ in K80, F84, or HKY85 is given at a fixed value or is to be estimated by iteration from the data. If fix_kappa = 1, the value of another variable, kappa, is the given value. Otherwise, if fix_kappa = 0 is used in the control file, then the value specified in variable kappa is used as the initial estimate for iteration. The variables fix_kappa and kappa have no effect with JC69 or F81 which do not involve such a parameter, or with TN93 and REV which have two and five rate parameters, respectively, when all of them are estimated from the data. The option fix_kappa = 2 is used together with nhomo = 5 to specify nonstationary models in BASEML in which the whole rate matrix Q (both parameter $\kappa$ or the exchangeability parameters and the base frequency parameters) vary among branches. See the description of variable nhomo below.

  • fix_alpa, alpha, and ncatG: these variables work in a similar way as fix_kappa and alpha. In this case, alpha refers to the shape parameter $\alpha$ of the gamma distribution for variable substitution rates across sites (Yang 1994b). The model of a single rate for all sites is specified by setting fix_alpha = 1 and alpha = 0 in the control file (i.e., using 0 in alpha = 0 means infinity), while the (discrete-) gamma model is specified by a positive value for alpha. Variable ncatG is then the number of categories for the discrete-gamma model (BASEML). Values such as 5, 4, 8, or 10 are reasonable. Note that the discrete gamma model has one parameter ($\alpha$), like the continuous gamma model, and the number of categories is not a parameter.

Important

The model of invariable sites is not implemented in PAML programs. I don’t like the model as it generates a strong correlation between the proportion of invariable sites and the gamma shape parameter. To infer rates at individual sites, use RateAncestor = 1 (see more details about this variable below). Using a large number of categories (say, ncatG = 40) may be helpful if you are interested in calculating such rates. For detailed descriptions of those models, see Yang (1996a), Chapter 1 of Yang (2006), and chapters 13 and 16 of Felsenstein (2004).

  • fix_rho and rho: these variables work in a similar way and concern independence or correlation of rates at adjacent sites, where $\rho$ (i.e., variable rho in the control file) is the correlation parameter of the auto-discrete-gamma model (Yang 1995). The model of independent rates for sites is specified by setting fix_rho = 1 and rho = 0 in the control file, if choosing to specify also alpha = 0 further means a constant rate for all sites. The auto-discretegamma model is specified by positive values for both alpha and rho variables in the control file. The model of a constant rate for sites is a special case of the (discrete) gamma model, which can be enabled by specifying $\alpha=\infty$ (i.e., variable alpha = 0), and the model of independent rates for sites is a special case of the auto-discrete-gamma model, which can be enabled by setting $\rho = 0$ (i.e., variable rho = 0).

  • nparK: this variable specifies nonparametric models for variable and Markov-dependent rates across sites. Specifically, when setting nparK = 1 or naprK = 2, several categories of independent rates for sites are enabled (i.e., specified with variable ncatG in the control file), while nparK = 3 or nparK = 4 means the rates are Markov-dependent at adjacent sites. Note that nparK = 1 and nparK= 3 have the restriction that each rate category has equal probability, while nparK = 2 and nparK = 4 do not have this restriction (Yang 1995). Please note that variable nparK takes precedence over variables alpha or rho.

  • nhomo: this variable is only available for BASEML, and it is used to implement nonstationary models in which the base compositions change along branches on the tree. The basic substitution model (i.e. which can be enabled via option model in the control file) should be one that has frequency parameters (i.e., F81, F84, HKY85, T92, TN93, REV/GTR). When specifying nhomo = 1, a homogeneous model is fitted, but the frequency parameters $\pi_{T}$, $\pi_{C}$, and $\pi_{A}$ (i.e., note that $\pi_{G}$ is not a free parameter as the frequencies need to sum up to 1) are estimated by maximum likelihood iteration. This case applies to F81, F84, HKY85, T92 (in which case only $\pi_{GC}$ is a parameter), TN93, or REV models. The default option in the control file is nhomo = 0, which means that the frequency parameters are estimated by the averages of the observed frequencies. For both nhomo = 0 and nhomo = 1, you should count 3 (or 1 for T92) free parameters for the base frequencies (see above).

    Note that, when setting nhomo = 2, then one transition/transversion rate ratio ($\kappa$) is used for each branch in the input tree file for the K80, F84, and HKY85 models (Yang 1994c; Yang and Yoder 1999).

    Options nhomo = 3, nhomo = 4, and nhomo = 5 are used to fit nonhomogeneous models in which base compositions drift on the tree (Yang and Roberts 1995; Galtier and Gouy 1998). The nucleotide substitution is specified by the variable model and is one of F84, HKY85, T92, TN93, and REV/GTR. Different frequency parameters are used in the rate matrix for different branches in the tree (please check the rate matrices implemented in PAML programs in the "Substitution models" markdown file). Those options may be used together with fix_kappa = 2 to allow the exchangeability parameters in the substitution rate matrix ($\kappa$ in F84, HKY85, and T92; $\kappa_{1}$ and $\kappa_{2}$ in TN93; and a, b, c, d, e in REV/GTR) to differ among branches of the tree as well. Thus the whole Q matrix can be different for different branches (see Q matrices in the "Substitution models" markdown file).

    IMPORTANT
    The position of the root may make a difference to the likelihood, so that in theory it may be necessary to use rooted trees. Nevertheless, parameters around the root, including the two branch lengths around the root, are expected to be poorly estimated since the model is only weakly identifiable. In short, you need to decide whether it is appropriate to use a rooted tree or unrooted tree (you can help inform your decision by reading section "Rooted versus unrooted trees in the supplementary material in Álvarez-Carretero et al. 2023).

    Because of the parameter richness, the models may only work if you have very long sequences. Yang and Roberts (1995) used the HKY85 or F84 models, and so three independent frequency parameters are used to describe the substitution pattern, while Galtier and Gouy (1998) used the T92 substitution model and uses the GC content $\pi_{GC}$ only, with the base frequencies give as $\pi_{T} = \pi_{A} = (1 – \pi_{GC})/2$ and $\pi_{C} = \pi_{G} = \pi_{GC}/2$. The option nhomo = 4 assigns one set of frequency parameters for the root, which are the initial base frequencies at the root, and one set for each branch in the tree. This is model N2 in Yang and Roberts (1995) if the substitution model is F84 or HKY85 or the model of Galtier and Gouy (1998) if the substitution model is T92. When specifying option nhomo = 3, the following are used: one set of base frequencies for each tip branch, one set for all internal branches in the tree, and one set for the root. This is model N1 in Yang and Roberts (1995). If using option nhomo = 5, the user needs to specify how many sets of frequency parameters should be used and which branch should use which set. The set for the root specifies the initial base frequencies at the root, while the set for any other node is for parameters in the substitution matrix along the branch leading to the node. Node (branch) labels in the tree file are to be used to tell the program which set should use (please see subsection “Tree file format and representations of tree topology” in the "Data formatting" markdown file for more details). There is no need to specify the default set with any label (i.e., set 0). For instance, when specifying nhomo = 5 and the following tree in the tree file, there are four sets that label four tip branches (i.e., #1, #2, #3, and #4), set 5 is the root (i.e., label #5), and all the internal branches (nodes) will be the default set 0 (i.e., no labels required). This is equivalent to specifying nhomo = 3.

    4 1
    (((1 #1, 2 #2), 3 #3), 4 #4) #5;
    

    The output for nhomo = 3, nhomo = 4, and nhomo = 5 is under the heading (frequency parameters for branches) [frequencies at nodes] (see Yang & Roberts 1995 fig 1) in the output file. Two sets of frequencies are listed for each node. The first set are the parameters (used in the substitution rate matrix for the branch leading to the node), and the second set are the expected base frequencies at the node, calculated from the model (Yang and Roberts 1995; page 456 column top). If the node is the root, the same set of frequencies are printed twice.

    Note that the use of the variable fix_kappa together with nhomo = 3, nhomo = 4, or nhomo = 5 in the control file is somewhat unusual. When setting fix_kappa = 1, one common $\kappa$ is assumed and estimated for all branches, while fix_kappa = 0 means that one $\kappa$ is estimated for each branch. When specifying fix_kappa = 2, the same branch/node labels are used to specify the exchangeability parameters (i.e., $\kappa$ in F84, HKY85, and T92; $\kappa_{1}$ and $\kappa_{2}$ in TN93, and $a$, $b$, $c$, $d$, and $e$ in REV/GTR) for branches. As an example, suppose we have the following options in the control file:

      model     = 7 * REV/GTR model
      nhomo     = 5
      fix_kappa = 2
    

    In addition, suppose the input tree file has the following node/branch labels:

    4 1
    (((1 #1, 2 #2), 3 #3), 4 #4) #5;
    

    In this case, there will be 49 parameters: 6 branch lengths in the rooted tree, 5 sets of exchangeability parameters (i.e., $a$, $b$, $c$, $d$, and $e$), and 6 sets of base frequency parameters, with $6 + 5 \times 5 + 6 \times 3 = 49$. In other words, the model will assume different GTR rate matrices for the four tip branches (with 8 free parameters in each Q matrix), and one Q matrix for the two internal branches, plus the initial base frequencies at the root, plus the 6 branch lengths, with $5 \times 8 + 3 + 6 = 49$.

    IMPORTANT
    Note that the same labels in the input tree file are used to label both the nodes (for the frequencies) and the branches (for the exchangeability parameters in the rate matrix). An awkward issue is that we need a set of base frequencies at the root node, but we do not need a set of exchangeability parameters for the root branch. Thus, with a rooted tree of $s$ species, there are $2s – 1$ sets of frequencies and $2s – 2$ sets of exchangeability parameters. If the root has a separate frequency parameter, please use the last (largest) label for the root. Following the example above, the root node has been identified with label #5 as it will have a separate set of frequency parameters. In that way, there are six sets of frequency parameters identified by labels #0, #1, #2, #3, #4, and #5 (i.e., these labels identify nodes), while the first five labels (i.e., #0, #1, #2, #3, and #4) are reused to identify the five sets of branches for the exchangeability parameters.

    As another example, the following two Newick trees with labels to identify nodes/branches would be equivalent:

    (((1 #1, 2 #1), 3), 4);
    (((1 , 2) #1, 3 #1) #1, 4 #1) #1;
    

    Either of them will fit the same model, which assumes that the base frequencies have been stationary until the ancestor of species 1 and 2, at which point the two lineages 1 and 2 have been drifting away from the old frequencies. Note that all branches and nodes are labelled, with the default to be #0. In particular, the root node in the first tree has the label #0.

    When setting options nhomo = 3, nhomo = 4, or nhomo = 5, the Expected Markov counting (EMC) method of Matsumoto et al. (2015) generates printouts under the heading Expected numbers of nucleotide changes on branches expected in the output file. This is available for the model of one rate for all sites only when fix_alpha = 1 and alpha = 0 are specified in the control file.

  • getSE: this variable can be set to getSE = 0, getSE = 1, or getSE = 2. These three options tell whether we want estimates of the standard errors of estimated parameters. These are crude estimates, calculated by the curvature method, i.e., by inverting the matrix of second derivatives of the log-likelihood with respect to parameters. The second derivatives are calculated by the difference method and are not always reliable. Even if this approximation is reliable, tests relying on the SE's should be taken with caution, as such tests rely on the normal approximation to the maximum likelihood estimates. The likelihood ratio test should always be preferred. Please specify getSE = 0 when tree-search is performed as this option is not available under this scenario.

  • RateAncestor: this variable can be set to 0 or 1. If RateAncestor = 1, the program is forced to do two additional analyses, which you can ignore if you do not need the results. First, under a model of variable rates across sites such as the gamma, RateAncestor = 1 forces the program to calculate rates for individual sites along the sequence (output that will be saved in the output rates file), using the empirical Bayes procedure (Yang and Wang 1995). Secondly, RateAncestor = 1 forces the program to perform the empirical Bayesian reconstruction of ancestral sequences (Yang et al. 1995b; Koshi and Goldstein 1996; [Yang 2006 pages 119-124(http://abacus.gene.ucl.ac.uk/CME/)]). Ancestral state reconstruction by parsimony (Fitch 1971; Hartigan 1973) is well known (i.e., it is implemented in the PAML program pamp). It can also be achieved using the likelihood/empirical Bayes approach. Often, the two approaches produce similar results, but the likelihood-based reconstruction has two advantages over parsimony: this method uses information from the branch lengths and the relative substitution rates between characters (nucleotides) and provides a measure of uncertainties in the form of posterior probabilities for reconstructed ancestral states.

    The results are listed, by site, in the output file rst. You can also use the variable verbose to control how much information you want to be written in this output file. If verbose = 0, only the best nucleotide (the one with the highest posterior probability) at each node at each site is listed, while with verbose = 1 (try 2 if 1 does not work), the full posterior probability distribution from the marginal reconstruction is listed. If the model is homogenous (i.e., if nhomo = 0 or nhomo = 1 have been specified in the control file) and assumes one rate for all sites, both the joint and marginal ancestral reconstructions will be calculated. If the model assumes variable rates among sites like the gamma model, only the marginal reconstructions are calculated. More details about ancestral sequence reconstruction in the section below

  • Small_Diff: this variable is a small value used in the difference approximation of derivatives. Use a value between $1e-8$ and $1e-5$ and check that the results are not sensitive to the value used by changing the variable Small_Diff in the control file, which will show whether the results are stable regardless of the value used.

  • cleandata: this variable controls whether ambiguity characters and gaps are to be removed from the input sequence file. If cleandata = 1, then sites involving ambiguity characters (undetermined nucleotides such as N, ?, W, R, Y, etc., anything other than the four nucleotides) or alignment gaps are removed from all sequences. If cleandata = 0 (i.e., the default option), those sites are kept. Please not that we recommend that sequences in the input sequence file are already aligned and columns or regions that are predominantly gaps or are hard to align have been dealt with prior to running BASEML; then use cleandata = 0 to preserve information in the data.

  • method: this variable controls the iteration algorithm for estimating branch lengths under a model of no clock. If method = 0, the old algorithm implemented in PAML is used, which updates all parameters including branch lengths simultaneously. The later algorithm implemented in PAML can be enabled by setting method = 1, an algorithm that updates branch lengths one by one. Please note that method = 1 does not work under the clock models (i.e., when specifying clock = 1, clock = 2, or clock = 3).

  • icode: this variable specifies the genetic code to be used for ancestral reconstruction of protein-coding DNA sequences. This is implemented to compare results of ancestral reconstruction with codon-based analysis. For example the "F3×4" codon model of Goldman and Yang (1994) is very similar to the nucleotide model HKY85 with different substitution rates and base frequencies for the three codon positions. The latter is implemented by using use options GC in the input sequence file (please check the "Data formatting" markdown file for details on how to specify this option in the input sequence file) together by including options model = 4 and Mgene = 4 in the control file. To use the option icode, you have to enable option RateAncestor = 1. Note that there the following genetic codes have been implemented in PAML, which are enabled by specifying values the setting the icode variable to the following options:

    • 0: universal,
    • 1: mammalian mt.
    • 2: yeast mt.
    • 3: mold mt.
    • 4: invertebrate mt.
    • 5: ciliate nuclear
    • 6: echinoderm mt.
    • 7: euplotid mt.
    • 8: alternative yeast nu.
    • 9: ascidian mt.
    • 10: blepharisma nu.


    NOTE
    These codes correspond to transl_table 1 to 11 of GENEBANK.

  • fix_blength: this variable tells the program what to do if the tree has branch lengths. When setting fix_blength = 0 (i.e., "ignore"), branch lengths in the input tree file (if they have been incorporated by the user) will be ignored and initial branch lengths are generated using pairwise distances and random numbers. If fix_blength = –1 (i.e., "random"), then random initial branch lengths will be used. This option generates very poor branch lengths and might be useful if there are multiple local optima on the likelihood surface. When specifying, fix_blength = 1 (i.e., "initial2), branch lengths in the tree file will be used as initial values for ML iteration. Please try to avoid incorporating in your input tree file branch lengths from a parsimony analysis, as those are numbers of changes for the entire sequence (rather than per site) and are very poor initial values. When setting fix_blength = 2 (i.e., "fixed"), branch lengths will be fixed at those values given in the tree file and not estimated by ML. In this case, you should make sure that the branch lengths are sensible; for example, if two sequences in the data file are different, but the branch lengths connecting the two sequences in the tree are all zero, the input sequence and the input tree will be in conflict and the program will abort. Lastly, when specifying fix_blength = 3 (i.e., "proportional"), branch lengths will be proportional to those given in the tree file, and the proportionality factor is estimated by ML.

Output file

The output should be self-explanatory. Descriptive statistics are always listed. The observed site patterns and their frequencies are listed, together with the proportions of constant patterns. Nucleotide frequencies for each species (and for each gene in case of multiple gene data) are counted and listed. Note that lmax = ln(Lmax) is the upper limit of the log likelihood and may be compared with the likelihood for the best (or true, if simulations) tree under the substitution model to test the model's goodness of fit to data (Goldman 1993; Yang et al. 1995a). You can ignore this if you do not know what it means.

The pairwise sequence distances are included in the output as well, and also in a separate file called 2base.t. Note that this matrix is a lower-diagonal distance matrix, readable by the NEIGHBOR program in Felsenstein's PHYLIP package (Felsenstein 2005). For models JC69, K80, F81, and F84, the appropriate distance formulas are used, while for more complex models, the TN93 formula is used. BASEML is mainly a maximum likelihood program, and the distance matrix is printed out for convenience and really has nothing to do with the later likelihood calculation.

If you enable variable getSE = 1 in the control file, the standard errors (S.E.s) are calculated as the square roots of the large sample variances and listed exactly below the parameter estimates. Zeros on this line mean errors, either caused by divergence of the algorithm or zero branch lengths. The S.E.s of the common parameters measure the reliability of the estimates. For example, $(\kappa − 1)/\textrm{SE}(\kappa)$, when $\kappa$ is estimated under K80, can be compared with a normal distribution to see whether there is real difference between K80 and JC69. The test can be more reliably performed by comparing the log-likelihood values under the two models, using the likelihood ratio test. It has to be stressed that the S.E.’s of the estimated branch lengths should not be misinterpreted as an evaluation of the reliability of the estimated tree topology (Yang 1994a).

If the tree file has more than one "tree block" (i.e., there are more than one tree in Newick format specified in the input tree file), the programs BASEML and CODEML will calculate the bootstrap proportions using the RELL method (Kishino and Hasegawa 1989), as well as the method of Shimodaira and Hasegawa (1999) with a correction for multiple comparison. The bootstrap resampling accounts for possible data partitions (option G in the input sequence file, please check the "Data formatting" markdown file for details on how to specify this option in the input sequence file).

Notes on ancestral sequence reconstruction (RateAncestor)

A good use of ancestral sequence reconstruction is to synthesise the inferred ancestral proteins and measure their biochemical properties in the lab (Pauling and Zuckerkandl 1963; Chang and Donoghue 2000; Thornton 2004). It is also very popular to use reconstructed ancestral sequences as if they were real observed data to perform further analysis. You should resist this irresistible temptation and use full likelihood methods if they are available (e.g., Yang 2002). See section 4.4.4 in Yang (2006) for a discussion of systematic biases in ancestral reconstruction.

For nucleotide-based analysis of protein coding DNA sequences (i.e., using BASEML with option GC in the input sequence file, see corresponding section in "Data formatting" markdown file for details on this format), the program also calculates the posterior probabilities of ancestral amino acids. In this analysis, branch lengths and other parameters are estimated under a nucleotide substitution model, but the reconstructed nucleotide triplets are treated as a codon to infer the most likely amino acid encoded. Posterior probabilities for stop codons are small and reset to zero to scale the posterior probabilities for amino acids. To use this option, you should add the variable icode in the control file baseml.ctl. Note that variable icode can take a value out that ranges from 0 to 11, which correspond to the 12 genetic codes included in PAML (see details for variable icode above).

A nucleotide substitution model that is very close to a codon-substitution model can be specified as follows. You add the option characters GC at the end of the first line in the input sequence file and set model = 4 (HKY85) and Mgene = 4 in the control file. The model then assumes different substitution rates, different base frequencies, and different transition/transversion rate ratio (£\kappa$) for the three codon positions. Ancestral reconstruction from such a nucleotide substitution should be very similar to codon-based reconstruction.

Ancestral reconstruction under nonhomogeneous models

I have added the option of joint ancestral reconstruction under the nonhomogeneous models. The variables that enable this feature are nhomo = 3 (N1 in YR1995), nhomo = 4 (N2 in YR1995), and nhomo = 5 (user-specified branch types) under the nucleotide substitution models enabled when specifying model = 3 (F84), model = 4 (HKY85), model = 5 (T92), model = 6 (TN93), or model = 7 (REV/GTR). This approach works only for the model of one rate for all sites, and does not work for the model of gamma rates for sites or the partition models (i..e, option G). Only joint reconstruction is available as the algorithm I used for marginal reconstruction does not work for nonhomogeneous models.

Marginal and joint reconstructions

Marginal reconstruction considers character assignments to one single interior node and the character with the highest posterior probability is the best reconstruction (eq. 4 in Yang et al. 1995b; or Eq. 4.15 in Yang 2006). The algorithm for marginal reconstruction implemented in PAML works under both the model of a constant rate for all sites and the gamma model of rates at sites. Joint reconstruction considers all ancestral nodes at the same time and the reconstruction (the set of characters at a site assigned to all interior nodes) with the highest posterior probability is the best reconstruction (eq. 2 in Yang et al. 1995b; or Eq. 4.16 in Yang 2006). The algorithm for joint reconstruction implemented in PAML is based on that of Pupko et al. (2000), which gives the best reconstruction at each site and its posterior probability. This method works under the model of one rate for all sites only (note that this approach works under the partition models).

The marginal and joint approaches use slightly different criteria, and expected to produce consistent results; that is, the most probable joint reconstruction for a site should almost always consist of characters that are also the best in the marginal reconstruction. Conflicts may arise in borderline cases where two or more reconstructions have similar posterior probabilities.