Skip to content

Overview

sabifo4 edited this page Nov 13, 2024 · 3 revisions

General overview

PAML documentation

Besides this manual, please note that you can always consult the following additional resources:

  • Ziheng Yang Lab's website: this website has information about downloading and compiling PAML programs too.
  • PAML FAQ page: document that compiles various FAQs since PAML 4 was released. Last update: 2005/01/05.
  • PAML discussion group: if you have any questions with regards to using PAML programs, please post them on this discussion Google group, do not open new issues on this GitHub repository. The latter should strictly be used for technical problems with PAML programs.

What PAML programs can do

The PAML package currently includes the following programs: BASEML, basemlg, CODEML, evolver, pamp, yn00, MCMCtree, and chi2. A brief overview of the most commonly used models and methods implemented in PAML is provided by Yang (2007). The book Yang (2006) describes the statistical and computational details. Examples of analyses that can be performed using the package include the following:

  • Comparison and tests of phylogenetic trees (BASEML and CODEML).
  • Estimation of parameters in sophisticated substitution models, including models of variable rates among sites and models for combined analysis of multiple genes or site partitions (BASEML and CODEML).
  • Likelihood ratio tests (LRTs) of hypotheses through comparison of implemented models (BASEML, CODEML, chi2).
  • Estimation of divergence times under global and local clock models (BASEML and CODEML).
  • Likelihood (Empirical Bayes) reconstruction of ancestral sequences using nucleotide, amino acid, and codon models (BASEML and CODEML).
  • Generation of datasets of nucleotide, codon, and amino acid sequence by Monte Carlo simulation (evolver).
  • Estimation of synonymous and nonsynonymous substitution rates and detection of positive selection in protein-coding DNA sequences (yn00 and CODEML).
  • Bayesian estimation of species divergence times incorporating uncertainties in fossil calibrations (MCMCtree).

The strength of PAML is its collection of sophisticated substitution models. Tree search algorithms implemented in BASEML and CODEML are rather primitive, so except for very small datasets with say, <10 species, you are better off using another software such as raxml-ng, IQ-TREE, PhyloBayes, or MrBayes to infer the tree topology/ies, which you can then evaluate using BASEML or CODEML as input tree/s.

  • BASEML and CODEML: The program BASEML is for maximum likelihood analysis of nucleotide sequences. The program CODEML is formed by merging two old programs: codonml, which implements the codon substitution model of Goldman and Yang (1994) for protein-coding DNA sequences, and aaml, which implements models for amino acid sequences. These two are now distinguished by the variable seqtype in the control file codeml.ctl, with 1 for codon sequences and 2 for amino acid sequences. In this document, I use codonml and aaml to refer to CODEML with seqtype = 1 and seqtype = 2, respectively. The programs BASEML and CODEML use similar algorithms to fit models by maximum likelihood, the main difference being that the unit of evolution in the Markov model, referred to as a "site" in the sequence, is a nucleotide, a codon, or an amino acid for the three programs, respectively. Markov process models are used to describe substitutions between nucleotides, codons, or amino acids, with substitution rates assumed to be either constant or variable among sites.
  • evolver: This program can be used to simulate sequences under nucleotide, codon, and amino acid substitution models. It also has some other options such as generating random trees and calculating the partition distances (Robinson and Foulds 1981) between trees.
  • basemlg: This program implements the (continuous) gamma model of Yang (1993). It is very slow and unfeasible for data of more than 6 or 7 species. Instead, the discrete-gamma model in BASEML described in Yang (1994) should be used.
  • MCMCtree: This program implements the Bayesian MCMC algorithm of Yang and Rannala (2006) and Rannala and Yang (2007) for estimating species divergence times.
  • pamp: This program implements the parsimony-based analysis of Yang and Kumar (1996).
  • yn00: This program implements the method of Yang and Nielsen (2000) for estimating synonymous and nonsynonymous substitution rates (dS and dN) in pairwise comparisons of protein-coding DNA sequences.
  • chi2: This calculates the $\chi_{2}$ critical value and p-value for conducting the likelihood ratio test. Run the program by typing its name: chi2. Once you do this, the software will print out the critical values for different d.f. (for example, the 5% critical value with d.f. = 1 is 3.84). If you run the program with one command-line argument, the program enters a loop to ask you to input the d.f. and the test statistic and then calculates the p-value. A third way of running the program from the command line is to include the d.f. and test statistic both as command-line argument. For instance:
chi2
chi2 p
chi2 1 3.84

What PAML programs cannot do

There are many things that you might well expect a phylogenetics package should do, but PAML cannot. Below, you can find a partial list of such limitations, provided in the hope that it might help you avoid wasting time.

  • Sequence alignment: You should use some other programs such as Muscle5, mafft, or BAli-Phy (just to name a few, there are many more you can use!) to align the sequences automatically. Manual adjustment does not seem to have reached the mature stage to be entirely trustable, so you should always do that with care. If you are constructing thousands of alignments in genome-wide analysis, you should implement some quality control, and, say, calculate some measure of sequence divergence as an indication of the unreliability of the alignment. For coding sequences, you might align the protein sequences and construct the DNA alignment based on the protein alignment. Note that, if cleandata = 0, both ambiguity characters and alignment gaps are treated as ambiguity characters in BASEML and CODEML. If cleandata = 1, all sites with ambiguity characters and alignment gaps are removed from all sequences before analysis.
  • Gene prediction: The codon-based analysis implemented in CODEML (seqtype = 1) assumes that the sequences are pre-aligned exons, the sequence length is an exact multiple of 3, and the first nucleotide in the sequence is codon position 1. Introns, spacers, and other non-coding regions must be removed and the coding sequences must be aligned before running the program. The program cannot process sequences downloaded directly from GenBank, even though the CDS information is there, nor predict coding regions.
  • Tree search in large data sets: As mentioned earlier, you should use another program to get a tree or some candidate trees and use them as user trees to fit models that might not be available in other packages.

Running PAML programs

Before running a PAML program, please make sure that you have followed the installation instructions according to your operating system. When PAML programs are exported to the system's path, you can run a program by typing its name from the command line. If your working directory is not the same where you have your sequence file, tree file, and control file, you should know the relative/absolute path to such folder. If inexperienced and/or you are having issues to export paths (see Installation.md for tips on how to do this for different operating systems), you may copy the relevant executable file to the folder containing your data files, and run the PAML program from this folder.

Note

When running CODEML, please note that you may need a data file such as grantham.dat, dayhoff.dat, jones.dat, wag.dat, mtREV24.dat, mtmam.dat, etc.; so you should copy these files as well in the same directory where you have your input files and control file (and add the corresponding name in variable aaRatefile in the control file!). You can find these files in the dat directory, which you will have access from your file system once you clone the repository or download the latest release. Alternatively, you can always type the relative path to the file you want to use in variable aaRatefile.

Important

Some PAML programs produce result files such as as rub, lnf, rst, or rates. You should not use these names (or other names that PAML programs use to create output files) for your own files. Otherwise, they will be overwritten!

Example data sets

The examples/ folder contains many example data sets. They were used in the original papers to test the new methods, and I included them so that you could duplicate our results in the papers. Sequence alignments, control files, and detailed readme files are included. They are intended to help you get familiar with the input data formats and with interpretation of the results, and also to help you discover bugs in the program. If you are interested in a particular analysis, get a copy of the paper that described the method and analyse the example dataset to duplicate the published results. This is particularly important because the manual, as it is written, describes the meanings of the control variables used by the programs but does not clearly explain how to set up the control file to conduct a particular analysis.

  • examples/HIVNSsites/: This folder contains example data files for the HIV-1 env V3 region analysed in Yang et al. (2000b). The data set is for demonstrating the NSsites models described in that paper, that is, models of variable $\omega$ ratios among amino acid sites. Those models are called the “random-sites” models by Yang & Swanson (2002) since a priori we do not know which sites might be highly conserved and which under positive selection. They are also known as “fishing-expedition” models. The included data set is the 10th data set analysed by Yang et al. (2000b), and the results are in table 12 of that paper. Look at the README.txt file in that folder.
  • examples/lysin/: This folder contains the sperm lysin genes from 25 abalone species analysed by Yang, Swanson & Vacquier (2000a) and Yang and Swanson (2002). The data set is for demonstrating both the “random-sites” models (as in Yang, Swanson & Vacquier (2000a)) and the “fixed-sites” models (as in Yang and Swanson (2002)). In the latter paper, we used structural information to partition amino acid sites in the lysin into the “buried” and “exposed” classes and assigned and estimated different $\omega$ ratios for the two partitions. The hypothesis is that the sites exposed on the surface are likely to be under positive selection. Look at the README.txt file in that folder.
  • examples/lysozyme/: This folder contains the primate lysozyme c genes of Messier and Stewart (1997), re-analysed by Yang (1998). This is for demonstrating codon models that assign different $\omega$ ratios for different branches in the tree, useful for testing positive selection along lineages. Those models are sometimes called branch models or branch-specific models. Both the “large” and the “small” data sets in Yang (1998) are included. Those models require the user to label branches in the tree, and the readme file and included tree file explain the format in great detail. See also the section “Tree file and representations of tree topology” later about specifying branch/node labels. The lysozyme data set was also used by Yang and Nielsen (2002) to implement the so-called “branch-site” models, which allow the $\omega$ ratio to vary both among lineages and among sites. Look at the README.txt file to learn how to run those models.
  • examples/MouseLemurs/: This folder includes the mtDNA alignment that Yang and Yoder (2003) analysed to estimate divergence dates in mouse lemurs. The data set is for demonstrating maximum likelihood estimation of divergence dates under models of global and local clocks. The most sophisticated model described in that paper uses multiple calibration nodes simultaneously, analyses multiple genes (or site partitions) while accounting for their differences, and also account for variable rates among branch groups. The README.txt file explains the input data format as well as model specification in detail. The README2.txt file explains the ad hoc rate smoothing procedure of Yang (2004).
  • examples/mtCDNA/: This folder includes the alignment of 12 protein-coding genes on the same strand of the mitochondrial genome from seven ape species analysed by Yang, Nielsen, & Hasegawa (1998) under a number of codon and amino acid substitution models. The data set is the “small” data set referred to in that paper, and was used to fit both the “mechanistic” and empirical models of amino acid substitution as well as the “mechanistic” models of codon substitution. The model can be used, for example, to test whether the rates of conserved and radical amino acid substitutions are equal. See the README.txt file for details.
  • examples/TipDate.HIV2/: This folder includes the alignment of 33 SIV/HIV-2 sequences, compiled and analysed by Lemey et al. (2003) and re-analysed by Stadler and Yang (2013). The README.txt file explains how to duplicate the ML and Bayesian results published in that paper. Note that the sample date is the last field in the sequence name.

Some other data files are included in the package as well. The details follow:

Clone this wiki locally