# Substitution models

## Nucleotide substitution models

For detailed descriptions of Markov models of nucleotide substitution, see [Whelan et al. (2001)](https://pubmed.ncbi.nlm.nih.gov/11335036/), [Felsenstein (2004)](https://www.journals.uchicago.edu/doi/abs/10.1086/423061?journalCode=qrb). or [Yang (2006: Chapter 1)](http://abacus.gene.ucl.ac.uk/CME/).

Models used in `PAML` include **JC69** ([Jukes and Cantor 1969](https://www.scirp.org/reference/ReferencesPapers?ReferenceID=1870036)), **K80** ([Kimura 1980](https://pubmed.ncbi.nlm.nih.gov/7463489/)), **F81** ([Felsenstein 1981](https://pubmed.ncbi.nlm.nih.gov/7288891/)), **F84** (Felsenstein 2005, [DNAML program since 1984, PHYLIP Version 2.6](https://phylipweb.github.io/phylip/index.html)), **HKY85** ([Hasegawa et al. 1984](https://www.jstage.jst.go.jp/article/pjab1977/60/4/60_4_95/_pdf); [Hasegawa et al. 1985](https://link.springer.com/article/10.1007/BF02101694)), **T92** ([Tamura 1992](https://pubmed.ncbi.nlm.nih.gov/1630306/)), **TN93** ([Tamura and Nei 1993](https://pubmed.ncbi.nlm.nih.gov/8336541/)), and **REV**, also known as the GTR model for general-time-reversible ([Yang 1994c](https://pubmed.ncbi.nlm.nih.gov/8064867/); [Zharkikh 1994](https://pubmed.ncbi.nlm.nih.gov/7932793/)). The rate matrices are parameterised as follows:

> **JC69**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & 1 & 1 & 1\\
1 & . & 1 & 1\\
1 & 1 & . & 1\\
1 & 1 & 1 & .
\end{array}\right)$$

> **K80**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & \kappa & 1 & 1\\
\kappa & . & 1 & 1\\
1 & 1 & . & \kappa\\
1 & 1 & \kappa & .
\end{array}\right)$$

> **F81**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & \pi_{C} & \pi_{A} & \pi_{G}\\
\pi_{T} & . & \pi_{A} & \pi_{G}\\
\pi_{T} & \pi_{C} & . & \pi_{G}\\
\pi_{T} & \pi_{C} & \pi_{A} & .
\end{array}\right)$$

> **F84**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & (1+\kappa/\pi_{Y})/\pi_{C} & \pi_{A} & \pi_{G}\\
(1+\kappa/\pi_{Y})/\pi_{T} & . & \pi_{A} & \pi_{G}\\
\pi_{T} & \pi_{C} & . & (1+\kappa/\pi_{R})/\pi_{G}\\
\pi_{T} & \pi_{C} & (1+\kappa/\pi_{R})/\pi_{A} & .
\end{array}\right)$$
$$\textrm{with }\pi_{Y}=\pi_{T}+\pi_{C}\textrm{and}\pi_{R}=\pi_{A}+\pi_{G}.$$

> **HKY85**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & \kappa\pi_{C} & \pi_{A} & \pi_{G}\\
\kappa\pi_{T} & . & \pi_{A} & \pi_{G}\\
\pi_{T} & \pi_{C} & . & \kappa\pi_{G}\\
\pi_{T} & \pi_{C} & \kappa\pi_{A} & .
\end{array}\right)$$

> **T92**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & \kappa(1-\pi_{GC})/2 & \pi_{GC}/2 & \pi_{GC}/2\\
\kappa(1-\pi_{GC})/2 & . & \pi_{GC}/2 & \pi_{GC}/2\\
(1-\pi_{GC})/2 & (1-\pi_{GC})/2 & . & \kappa\pi_{GC}/2\\
(1-\pi_{GC})/2 & (1-\pi_{GC})/2 & \kappa\pi_{GC}/2 & .
\end{array}\right)$$

> **TN93**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & \kappa_{1}\pi_{C} & \pi_{A} & \pi_{G}\\
\kappa_{1}\pi_{T} & . & \pi_{A} & \pi_{G}\\
\pi_{T} & \pi_{C} & . & \kappa_{2}\pi_{G}\\
\pi_{T} & \pi_{C} & \kappa_{2}\pi_{A} & .
\end{array}\right)$$

> **REV (GTR)**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & a\pi_{C} & b\pi_{A} & c\pi_{G}\\
a\pi_{T} & . & d\pi_{A} & e\pi_{G}\\
b\pi_{T} & d\pi_{C} & . & \pi_{G}\\
c\pi_{T} & e\pi_{C} & \pi_{A} & .
\end{array}\right)$$

> **UNREST**

$$\mathrm{Q=}\left(\begin{array}{cccc}
. & q_{TC} & q_{TA} & q_{TG}\\
q_{CT} & . & q_{CA} & q_{CG}\\
q_{AT} & q_{AC} & . & q_{AG}\\
q_{GT} & q_{GC} & q_{GA} & .
\end{array}\right)=\left(\begin{array}{cccc}
. & a & b & c\\
d & . & e & f\\
g & h & . & i\\
j & k & l & .
\end{array}\right)$$

## Codon substitution models

There is now a large collection of codon substitution models, please see [Yang and Bielawski (2000)](https://pubmed.ncbi.nlm.nih.gov/11114436/), [Yang (2002)](https://pubmed.ncbi.nlm.nih.gov/12433583/) and [Yang (2006: Chapter 8)](http://abacus.gene.ucl.ac.uk/CME/) for detailed discussions. The basic model is a simplified version of the model of [Goldman and Yang (1994)](https://pubmed.ncbi.nlm.nih.gov/7968486/) and specifies the substitution rate from codon $i$ to codon $j$ as

$$q_{ij}=\begin{cases}
0, & \textrm{if the two codons differ at more than one position,}\\
\pi_{j}, & \textrm{for synonymous transversion,}\\
\kappa\pi_{j}, & \textrm{for synonymous transition,}\\
\omega\pi_{j}, & \textrm{for nonsynonymous transversion,}\\
\omega\kappa\pi_{j}, & \textrm{for nonsynonymous transition;}
\end{cases}$$

<!-- Alternative way in case the first does not render properly
```math
q_{ij}=\begin{cases}
0, & \textrm{if the two codons differ at more than one position,}\\
\pi_{j}, & \textrm{for synonymous transversion,}\\
\kappa\pi_{j}, & \textrm{for synonymous transition,}\\
\omega\pi_{j}, & \textrm{for nonsynonymous transversion,}\\
\omega\kappa\pi_{j}, & \textrm{for nonsynonymous transition,}
\end{cases}
```
-->

([Yang et al. 1998](http://abacus.gene.ucl.ac.uk/ziheng/pdf/1998YangNielsenHasegawaMBEv15p1600.pdf)). The equilibrium frequency of codon $j$ ($\pi_{j}$) can be considered a free parameter, but can also be calculated from the nucleotide frequencies at the three codon positions (control variable `CodonFreq`). Under this model, the relationship holds that $\omega=d_{N}/d_{S}$, the ratio of nonsynonymous/synonymous substitution rates. This basic model is fitted by specifying `model = 0` and `NSsites = 0`, in the control file to run `CODEML` (e.g., `codeml.ctl`).

The $\omega$ ratio is a measure of natural selection acting on the protein. Simplistically, values of $\omega < 1, = 1, \textrm{and} > 1$ means negative purifying selection, neutral evolution, and positive selection, respectively. However, the ratio averaged over all sites and all lineages is almost never $> 1$, since positive selection is unlikely to affect all sites over prolonged time. Thus, interest has been focused on detecting positive selection that affects only some lineages or some sites.

### Branch models

The **branch models** allow the **$\omega$ ratio to vary among branches in the phylogeny** and are **useful for detecting positive selection acting on particular lineages** ([Yang 1998](https://pubmed.ncbi.nlm.nih.gov/9580986/);[Yang and Nielsen 1998](https://pubmed.ncbi.nlm.nih.gov/9541535)). They are specified using option `model`. Specifically, when setting `model = 1` the so-called free-ratios model is enabled, which assumes an independent $\omega$ ratio for each branch. This model is very parameter-rich and its use is discouraged. On the other hand, `model = 2` allows you to have several $\omega$ ratios. Note that you have to specify how many ratios and which branches should have which rates in the tree file by using branch labels. To better understand the notation you need to use for this purpose, please see [“Branch or node labels”](Data-formatting.md#branch-or-node-labels) under section [“Tree file format”](Data-formatting.md#tree-files-produced-by-paup-and-macclade) when you navigate through the ["Data formatting" markdown document](Data-formatting.md). The lysozyme example data files are included in the [`examples/lysozyme/` folder](https://github.com/abacus-gene/paml/blob/master/examples/lysozyme/); check [the `README.txt` file](https://github.com/abacus-gene/paml/blob/master/examples/lysozyme/README.txt).

### Site models

The **site models** allow the **$\omega$ ratio to vary among sites** (among codons or amino acids in the protein) ([Nielsen and Yang 1998](https://pubmed.ncbi.nlm.nih.gov/9539414/); [Yang et al. 2000b](https://pubmed.ncbi.nlm.nih.gov/10790415)). A number of such models are implemented in `codeml` using the variable `Nssites` together with option `model = 0`. You can run several `Nssites` models in one go (i.e., batch run), by specifying several values for `NSsites`. For example, `NSsites = 0 1 2 7 8` will fit 5 site models to the same data in one go.

Site models have been used in real data analyses and evaluated in computer simulation studies ([Anisimova et al. 2001](https://pubmed.ncbi.nlm.nih.gov/11470850/), [2002](https://academic.oup.com/mbe/article/19/6/950/1094968); [Anisimova et al. 2003](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1462615/); [Wong et al. 2004](https://academic.oup.com/genetics/article/168/2/1041/6059620)). **Two pairs of models appear to be particularly useful**, forming **two likelihood ratio tests of positive selection**. The first one compares `M1a` (NearlyNeutral) and `M2a` (PositiveSelection), while the
second one compares `M7` (beta) and `M8` (beta&$\omega$).

> [!IMPORTANT]
> Please note that models `M1a` (NearlyNeutral) and `M2a` (PositiveSelection) are slight modifications of models `M1` (neutral) and `M2` (selection) in Nielsen and [Yang (1998)](https://pubmed.ncbi.nlm.nih.gov/9539414). The old models `M1` and `M2` fix $\omega_{0}=1$ and $\omega_{1}=1$, and are unrealistic as they do not account for sites with $0<\omega < 1$. In the new models `M1a` and `M2a`, described in [Wong et al. (2004)](https://academic.oup.com/genetics/article/168/2/1041/6059620) and [Yang et al. (2005)](https://academic.oup.com/mbe/article/22/4/1107/1083468) and implemented in `PAML` since v3.14 (i.e., since 2004), $0< \omega_{0} < 1$ is estimated from the data, while $\omega_{1}=1$ is fixed.

The naïve empirical Bayes (NEB) ([Nielsen and Yang 1998](https://pubmed.ncbi.nlm.nih.gov/9539414); [Yang et al. 2000b](https://pubmed.ncbi.nlm.nih.gov/10790415)) and the Bayes empirical Bayes (BEB) ([Yang et al. 2005](https://academic.oup.com/mbe/article/22/4/1107/1083468), implemented since v3.14) are available for calculating the posterior probabilities for site classes, and can be used to identify sites under positive selection if the likelihood ratio test is significant. NEB uses the MLEs of parameters (such as the proportions and $\omega$ ratios) but do not account for their sampling errors, while **BEB deals with the sampling errors by applying a Bayesian prior**. In both tests, comparing `M2a` against `M1a` or `M8` against `M7`, $df = 2$ may be used (i.e., the degree of freedom, $df$, in the chi-square test is the difference in the number of free parameters between the two models being compared). The `M1a`-`M2a` comparison appears to be more stringent than the `M7`-`M8` comparison. **BEB is implemented under models `M2a` and `M8` only**.

> [!IMPORTANT]
> Please note that the NEB results you shall see in the output files generated by the `CODEML` program are to be disregarded; you should only consider the BEB results (see [Yang et al. 2005](https://academic.oup.com/mbe/article/22/4/1107/1083468) and [Álvarez-Carretero et al. 2023](https://pubmed.ncbi.nlm.nih.gov/37096789/)).

----

> The BEB output has the following format:

```text
Prob(w>1) mean w
135 K 0.983* 4.615 +- 1.329
```

> Interpretation: 4.615 is the approximate mean of the posterior distribution for $\omega$, and 1.329 is the square root of the variance in the posterior distribution for $\omega$. The program prints out an `*` if the posterior probability is >95%, and `**` if the probability is > 99%.

----

A third test compares the null hypothesis `M8a` (beta&$\omega_{s}$ =1) and `M8` ([Swanson et al. 2003](https://pubmed.ncbi.nlm.nih.gov/12519901/); [Wong et al. 2004](https://academic.oup.com/genetics/article/168/2/1041/6059620)). `M8a` is specified by setting the following options in the control file: `NSsites = 8`, `fix_omega = 1`, `omega = 1`. The null distribution is the
50:50 mixture of point mass 0 and $\chi_{1}^{2}$ ([Self and Liang 1987](https://www.tandfonline.com/doi/abs/10.1080/01621459.1987.10478472)). The critical values are 2.71 at 5% and 5.41 at 1% (as opposed to 3.84 for 5% and 6.63 for 1% for $\chi_{1}^{2}$). Note that the $p$ value for a 50:50 mixture of $\chi_{j}^{2}$ and $\chi_{k}^{2}$ is just the average of the two $p$ values from the two distributions. Un the case of `M8a`-`M8` comparison, you get the $p$ value from $\chi_{1}^{2}$ and then half it to get the $p$ value for the mixture distribution. You can also use $\chi_{1}^{2}$ ([Wong et al. 2004](https://academic.oup.com/genetics/article/168/2/1041/6059620)).

We suggest that the **`M0`-`M3` comparison** should be used as a **test of variable $\omega$ among sites rather than a test of positive selection**. However, the model of a single $\omega$ for all sites is probably wrong in every functional protein, so there is little point of testing.

For more details on the site models, please see the table below:

> **Table 2**. Parameters in the site models.

|     Model              |     `NSsites`    |     #p    |     Parameters                                                        |     Note                                                                     |
|------------------------|----------------|-----------|-----------------------------------------------------------------------|------------------------------------------------------------------------------|
|     `M0` (one ratio)     |     0          |     1     |     $\omega$                                                                 | ([Goldman and Yang 1994](https://pubmed.ncbi.nlm.nih.gov/7968486/); [Yang and Nielsen 1998](https://pubmed.ncbi.nlm.nih.gov/9541535))                               |
|     `M1a` (neutral)      |     1          |     2     |     $p_{0}$ $(p_{1}= 1 – p_{0})$, $\omega_{0}< 1$, $\omega_{1}=1$                         | ([Nielsen and Yang 1998](https://pubmed.ncbi.nlm.nih.gov/9539414/); [Yang et al. 2005](https://academic.oup.com/mbe/article/22/4/1107/1083468))                                    |
|     `M2a` (selection)    |     2          |     4     |     $p_{0}$, $p_{1}$ $(p_{2} = 1 – p_{0} – p_{1})$, $\omega_{0}< 1$, $w_{1}=1$, $\omega_{2} > 1$    | ([Nielsen and Yang 1998](https://pubmed.ncbi.nlm.nih.gov/9539414/);  [Yang et al. 2005](https://academic.oup.com/mbe/article/22/4/1107/1083468))                                    |
|     `M2a_rel`            |     22         |     4     |     $p_{0}$, $p_{1}$ $(p_{2} = 1 – p_{0} – p_{1})$, $\omega_{0}< 1$, $w_{1}=1$, $\omega_{2} > 0$    | $\omega_{2}>0$ for use as null for testing the clade model ([Weadick and Chang 2012](https://pubmed.ncbi.nlm.nih.gov/22319160/)) |
|     `M3` (discrete)      |     3          |     5     |     $p_{0}$, $p_{1}$ $(p_{2} = 1 – p_{0} – p_{1})$ $\omega_0$, $\omega{1}$, $\omega_{2}$                      | ([Yang et al. 2000b](https://pubmed.ncbi.nlm.nih.gov/10790415))
|     `M7` (beta)          |     7          |     2     |     $p$, $q$                                                              | ([Yang et al. 2000b](https://pubmed.ncbi.nlm.nih.gov/10790415))                                                          |
|     `M8` (beta&$\omega$)        |     8          |     4     |     $p_{0}$ $(p_{1}= 1 – p_{0})$, $p$, $q$, $\omega_{s} > 1$                              | ([Yang et al. 2000b](https://pubmed.ncbi.nlm.nih.gov/10790415))                                                          |

>> NOTE⎯ #p is the number of free parameters in the $\omega$ distribution. **Parameters in parentheses are not free** and should not be counted. For example, in `M1a`, $p_{1}$ is not a free parameter as $p_{1} = 1 – p_{0}$. In both likelihood ratio tests comparing `M1a` against `M2a` and `M7` against `M8`, **$df = 2$**. The site models are specified via option `NSsites` in the control file that executes `CODEML`.

#### Suzuki and Gojobori’s (1999) method for detecting sites under positive selection.

In the terminology used here, the `SG99` method tests whether the $\omega$ ratio for each site is $>1$ or $<1$. This approach uses parsimony to reconstruct ancestral sequences and then, for each site, counts the numbers of synonymous and nonsynonymous differences and the numbers of synonymous and nonsynonymous sits. Afterwards, the method tests whether the $\omega$ ratio at the site is significantly different from 1. Errors in the ancestral sequence reconstruction are ignored. Suzuki has a program called `AdaptSite` that implements the test. This is an intuitive test, and is known to lack power.

In `CODEML`, a test of this kind is implemented as a by-product of ancestral sequence reconstruction in both `CODEML` and `BASEML`, `PAML` programs that use ML to reconstruct ancestral sequences. If you want to enable that feature, please use option `RateAncestor = 1`. The choice of `BASEML` versus `CODEML` and also the choice of substitution model for each program affects ancestral sequence reconstruction only. The later steps are the same, and follow Suzuki & Gojobori (1999). For `CODEML`, you can use `M0` (i.e., specifying options `NSsites = 0` and `model = 0` in the control file). If you want, you can try some other models, such as `NSsites = 2` or `8`. The models for ancestral reconstruction typically make little difference. For `BASEML`, you should have `GC` on the first line of the sequence data file to indicate that the sequences are protein coding (i.e., please check section ["Option G in the "Data formatting" markdown file](Data-formatting.md#option-g) to familiarise with the notation). Use option `icode` (set to `0` for the universal code or `1` for vertebrate mitochondrial code) in the control file to specify the genetic code, as in `CODEML`. The following “multiple-gene” model is close to `M0`: you need to specify `model = 4` and `Mgene = 4` in the control file (see [Yang 1996b](https://pubmed.ncbi.nlm.nih.gov/8662011/)).

By specifying `NSsites = 22`, the site model `M2a_rel` of [Weadick & Chang (2012)](https://pubmed.ncbi.nlm.nih.gov/22319160/). This is the same model as `M2a` except for the fact that $\omega_{2} > 0$ (as mentioned above, model `M2a` is enabled by setting `NSsites = 2` and has $\omega_{2} > 1$). Note that **`M2a_rel` is the null model for the likelihood ratio test based on clade model C** ([Weadick and Chang 2012](https://pubmed.ncbi.nlm.nih.gov/22319160/)).

### Branch-site models

The branch-site models allow $\omega$ to vary both among sites in the protein and across branches on the tree and aim to detect positive selection affecting a few sites along particular lineages (called **_foreground branches_**). Initially, [Yang and Nielsen (2002)](https://academic.oup.com/mbe/article/19/6/908/1094851) implemented the **branch-site model A** (i.e., enabled by specifying in the control file options `model = 2` and `NSsites = 2`) and the **branch-site model B** (i.e., specifying in the control file options `model = 2` and `NSsites = 3`). The tests did not work well in simulations ([Zhang 2004](https://academic.oup.com/mbe/article/21/7/1332/1080342)), so **a change was introduced to model A** (see Table 3 below, also [Yang et al. 2005](https://academic.oup.com/mbe/article/22/4/1107/1083468); [Zhang et al. 2005](https://academic.oup.com/mbe/article/22/12/2472/1009544)), with two tests constructed. **Test 2**, also known the **branch-site test of positive selection**, is the **recommended test**. Specifically, tis test compares the modified model A with the corresponding null model with $\omega_{2}=1$ fixed (i.e., you can enable the null model by specifying in the control file options `fix_omega = 1` and `omega = 1`). The null distribution is the 50:50 mixture of point mass 0 and $\chi_{1}^{2}$, with critical values 2.71 at 5% and 5.41 at 1%. To calculate the $p$ value based on this mixture distribution, you calculate the $p$ value using $\chi_{1}^{2}$, and then divide it by 2. Suppose your $2\Delta\ell=2.71$, you will get 0.10 from  $\chi_{1}^{2}$, the the $p$ value for the mixture is 0.10/2 = 5%. **We recommend that you use  $\chi_{1}^{2}$ (with critical values 3.84 and 5.99) instead of the mixture to guide against violations of model assumptions**. Similarly, both the NEB and BEB methods for calculating posterior probabilities for site classes are implemented for the modified branch-site model A (not for model B).

> [!IMPORTANT]
> **You should use the branch-site model A in combination with the BEB procedure and ignore the NEB output**.

> **Table 3**. Parameters in branch-site model A ($np = 4$).

|     Site class    |     Proportion                    |     Background    |     Foreground      |
|-------------------|-----------------------------------|-------------------|---------------------|
|     0             |     $p_{0}$                            |     $0 < \omega_{0} < 1$    |     $0 < \omega_{0}   < 1$    |
|     1             |     $p_{1}$                            |     $\omega_{1} = 1$        |     $\omega_{1} = 1$          |
|     2a            |     $(1 – p_{0} – p_{1}) p_{0}/(p_{0} + p_{1})$    |     $0 < \omega_{0} < 1$    |     $\omega_{2}\geq1$          |
|     2b            |     $(1 – p_{0} – p_{1}) p_{1}/(p_{0} + p_{1})$    |     $\omega_{1} = 1$        |     $\omega_{2}\geq1$          |

>> NOTE⎯ Branch-site model A is specified using options `model = 2` and `NSsites = 2` in the control file. This is the **alternative model** in the **branch-site test of positive selection**. The **null model fixes $\omega_{2} = 1$**. The likelihood ratio test has **df = 1** (see text).

### Clade models

**Clade model C** is specified by setting in the control file options `model = 3` and `Nssites = 2`, while **clade model D** is specified by setting options `model = 3` and `NSsites = 3` ([Bielawski and Yang 2004](https://pubmed.ncbi.nlm.nih.gov/15383915/); see also [Forsberg and Christiansen 2003](https://pubmed.ncbi.nlm.nih.gov/12777510/)). In both models, the number of site classes (option `ncatG` in the control file) is fixed at `3` (i.e., `ncatG = 3`).

**Clade model C went through a modification**, in a similar way to branch-site model A. The **new model C** replaces $\omega_{0} = 0$ by **$0 < \omega{0} < 1$** and has **5 parameters in the $\omega$ distribution** (if there are two clades or branch types): $p_{0}$, $p_{1}$, $\omega_{0}$, $\omega_{2}$, and $\omega_{3}$. The new model C can be compared with the new `M1a` (NearlyNeutral), which has 2 parameters, with $df=3$. A more powerful test is to use the site model **`M2a_rel`** of [Weadick & Chang (2012)](https://pubmed.ncbi.nlm.nih.gov/22319160/) as the **null model**, with $df = 1$.

> **Table 4**. Parameters in clade models C or D, with more than two clades (branch types)

|     Site class    |     Proportion          |     Clade 1    |     Clade 2    |     Clade 3    |     Clade 4    |
|-------------------|-------------------------|----------------|----------------|----------------|----------------|
|     0             |     $p_{0}$                  |     $\omega_{0}$         |     $\omega_{0}$         |     $\omega_{0}$         |    $\omega_{0}$         |
|     1             |     $p_{1}$                  |     $\omega_{1}$         |     $\omega_{1}$         |     $\omega_{1}$         |     $\omega_{1}$         |
|     2             |     $p_{2} = 1 – p_{0} – p_{1}$    |     $\omega_{2}$         |     $\omega_{3}$         |     $\omega_{4}$         |     $\omega_{5}$         |

In **clade model C**, $\omega_{1} = 1$ is fixed, while in **clade model D**, $\omega_{1}$ is estimated as a free parameter. In both models, $\omega_{0}$ is estimated under the constraint $0 < \omega_{0} < 1$. The **BEB procedure** is implemented for **clade models C and D**. As aforementioned, please **ignore the results output by the NEB, only those output by the BEB are to be used**. An **extension** has been made to **clade models C and D to allow for more than two clades or branch types**. The **branch types** are specified using **labels** in the **tree file**. For instance, if you have four branch types, you would use labels `#0`, `#1`, `#2`, `#3` to identify each branch type.

When there are more than two clades, you can see that site class 2 (Table 4) has $\omega_{2}$, $\omega_{3}$, $\omega_{4}$, and $\omega_{5}$ as independent parameters, optimised in the range $(0, \infty)$. The BEB calculation under clade model C is very expensive (with each additional branch type increasing the computation by 10 folds), so that the model should be used with just a few branch types. **The maximum number of clades is set by the variable `NBTYPE` near the beginning of the source file `codeml.c`**. Note that the BEB calculation may be too expensive if you have more than 5 clades. As of February 2024, this variable is set to 8 so, if you want to reduce/increase this number, **you can edit the source file `codeml.c` and recompile again**.

### Mutation-selection model

The **mutation-selection models** of [Yang and Nielsen (2008)](https://pubmed.ncbi.nlm.nih.gov/18178545/) are implemented using the following control variables

```text
CodonFreq = 7    * 0:1/61 each, 1:F1X4, 2:F3X4, 3:codon table
                 * 4:F1x4MG, 5:F3x4MG, 6:FMutSel0, 7:FMutSel
estFreq = 0      * 0: use observed freqs; 1: estimate freqs by ML
```

where **`CodonFreq = 6` specifies the "FMutSel0"** model and **`CodonFreq = 7` specifies the "FmutSel"** model ([Yang and Nielsen 2008](https://pubmed.ncbi.nlm.nih.gov/18178545/)). The **"FmutSel"** model assigns a **fitness to every codon with 60** (= 61 – 1) **codon fitness parameters for the universal genetic code**. The **"FmutSel0" model** is a special case of "FmutSel" and assigns the **same fitness value for all synonymous codons**, so that only *19* (= 20 – 1) **amino acid fitness parameters are used**. This model assumes that the amino acid frequencies are determined by the functional requirements of the protein, and given the amino acid frequencies, the relative frequencies of synonymous codons are determined solely by the mutational-bias parameters.

If **`estFreq = 1`** is specified in the control file, then the **frequency/fitness parameters are estimated by ML from the data**, while if `estFreq = 0` is specified, they are calculated using the **observed frequencies in the data**. See [Yang and Nielsen (2008)](https://pubmed.ncbi.nlm.nih.gov/18178545/) for details of the model. You can also take a look at the [`README.txt` file](https://github.com/abacus-gene/paml/blob/master/examples/mtCDNA/README.txt) in the [`examples/mtCDNAape/` folder](https://github.com/abacus-gene/paml/tree/master/examples/mtCDNA) to see how to duplicate the results reported in Table 1 in [Yang and Nielsen (2008)](https://pubmed.ncbi.nlm.nih.gov/18178545/).

## Amino acid substitution models

I made a distinction between empirical and mechanistic models of amino acid substitution ([(Yang et al. 1998](https://pubmed.ncbi.nlm.nih.gov/9580986/); [2006: Chapter 2](http://abacus.gene.ucl.ac.uk/CME/)). Empirical models implemented in codeml include `Dayhoff` ([Dayhoff et al. 1978](https://chagall.med.cornell.edu/BioinfoCourse/PDFs/Lecture2/Dayhoff1978.pdf)), `JTT` ([Jones et al. 1992](https://pubmed.ncbi.nlm.nih.gov/1633570/)), `WAG` ([Whelan and Goldman 2001](https://academic.oup.com/mbe/article/18/5/691/1018653)), `mtMam` ([Yang et al. 1998](http://abacus.gene.ucl.ac.uk/ziheng/pdf/1998YangNielsenHasegawaMBEv15p1600.pdf)), `mtREV` ([Adachi and Hasegawa 1996b](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=162bf7c3da199f2020dde00b510c9f0bc857c757)), etc. These are estimates of substitution rate parameters under the general time reversible model from real datasets.

Mechanistic models are formulated by considering the mutational distance between the amino acids as determined by the locations of their encoding codons in the genetic code, and the filtering of
mutations by natural selection operating on the protein level ([Yang et al. 1998](http://abacus.gene.ucl.ac.uk/ziheng/pdf/1998YangNielsenHasegawaMBEv15p1600.pdf)). The program `aaml` (i.e., which is now launched with `CODEML` when selecting `seqtype = 2`) implements a few such models, specified by the variable `aaDist`.

When specifying options `seqtype = 2` and `model = 6` (`FromCodon1`), the mechanistic amino acid substitution model of Yang et al. ([1998](http://abacus.gene.ucl.ac.uk/ziheng/pdf/1998YangNielsenHasegawaMBEv15p1600.pdf); Table 3) is enabled, which uses the Markov chain for codons and aggregates the synonymous codons into one state (the coded amino acid) to construct a model of amino acid substitution. This is an approximate formulation since the process after state aggregation is not Markovian. The model involves the parameter $\kappa$, but $\omega$ is not estimable. Branch lengths are in the expected number of amino acid substitutions per amino acid.

When specifying options `seqtype = 2` and `model = 5` (`FromCodon0`), another codon-based amino acid substitution model is enabled, which treats amino acids as ambiguities codons. This is an exact formulation. This model involves both parameters $\kappa$ and $\omega$, but because $\omega$ is weakly identifiable, it cannot be reliably estimated. As a result, branch lengths are not reliably estimated either. Branch lengths are in the expected number of nucleotide substitutions per codon.