# `evolver` This program generates a naïve menu such as the one shown below: ```text (1) Get random UNROOTED trees? (2) Get random ROOTED trees? (3) List all UNROOTED trees into file trees? (4) List all ROOTED trees into file trees? (5) Simulate nucleotide data sets (use MCbase.dat)? (6) Simulate codon data sets (use MCcodon.dat)? (7) Simulate amino acid data sets (use MCaa.dat)? (8) Calculate identical bi-partitions between trees? (9) Calculate clade support values (evolver 9 treefile maintreefile )? (0) Quit? ``` * **Option 1, 2, 3, 4**: the program can be used to generate a random tree, either unrooted or rooted, either with or without branch lengths. It can also list all the trees for a fixed number of species. Of course, you should do this for a small number of species only as otherwise your hard drive will be filled by useless trees. * **Options 5, 6, 7 (Simulations)**: the program simulates nucleotide, codon, and amino acid sequences with user-specified tree topology and branch lengths. Parameters for simulation are specified by the user in three control files: [`MCbase.txt`](https://github.com/abacus-gene/paml/blob/master/examples/MCbase.txt) (or [`MCbase.dat`](https://github.com/abacus-gene/paml/blob/master/examples/MCbase.dat)), [`MCcodon.txt`](https://github.com/abacus-gene/paml/blob/master/examples/MCcodon.txt) (or [`MCcodon.dat`](https://github.com/abacus-gene/paml/blob/master/examples/MCcodon.dat)), and [`MCaa.txt`](https://github.com/abacus-gene/paml/blob/master/examples/MCaa.txt) (or [`MCaa.dat`](https://github.com/abacus-gene/paml/blob/master/examples/MCaa.dat)) for simulating nucleotide, codon, and amino acid sequences, respectively. The `MCbase.dat` file is shown below, but **please have a look at the other files depending on what character data you want to simulate**. ```text 1 * 0,1:seqs or patterns in PAML format (mc.paml); 2:PAUP format (mc.nex); 3:PAUP JC69 format -123 * random number seed (odd number) 5 895 100 * <# seqs> <# nucleotide sites> <# replicates> -1 * ((Human: 0.056647, Chimpanzee: 0.071554): 0.028080, Gorilla: 0.075518, (Orangutan: 0.259783, Gibbon: 0.400727): 0.100618); * this is your input tree file used for simulating data! 7 * model: 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85, 5:T92, 6:TN93, 7:REV 0.88892 0.03190 0.00001 0.07102 0.02418 * kappa or rate parameters in model 0.2500 5 * <#categories for discrete gamma> 0.25318 0.32894 0.31196 0.10592 * base frequencies T C A G ================================================== The rest of this data file are notes, ignored by the program evolver. Change the values of the parameters, but do not delete them. evolver simulates nucleotide sequences under the REV+Gamma model and its simpler forms. More notes: Parameter kappa or rate parameters in the substituton model: For TN93, two kappa values are required, while for REV, 5 values (a,b,c,d,e) are required (see Yang 1994 for the definition of these parameters). The kappa parameter is defined differently under HKY85 (when k=1 means no transition bias) and under F84 (when k=0 means no bias). JC69 and F81 are considered species cases of HKY85, so use 1 for kappa for those two models. Notation is from my two papers in JME in 1994. Use equal base frequencies (0.25) for JC69 and K80. Use 0 for alpha to have the same rate for all sites. Use 0 for <#categories for discrete gamma> to use the continuous gamma =========!! Check screen output carefully !! ===== ``` > **[[ IMPORTANT ]]**
> Please note that we only show an example to simulate alignments with nucleotide data, check the [`examples` folder](https://github.com/abacus-gene/paml/blob/master/examples/) and explore the other control files to simulate codon and amino acid data as aforementioned. The program generates multiple data sets in one file in either `PAML` (i.e., either simulation nucleotide sequences if the option is `0` or simulate site patterns if option `1`) or PAUP\* (i.e., variables `2` for PAUP format or `3` for PAUP JC69 format) format. If you choose the PAUP\* format, the program will look for files with the following names: `paupstart` (which the program copies to the start of the data file), `paupblock` (which the program copies to the end of each simulated data set), and `paupend` (which the program incorporates at the end of the file). This makes it possible to use PAUP\* to analyse all data sets in one run. Run the default options while watching out for screen output. Note that the first block of the file defines the parameters to simulate the data with `evolver`, while the rest are notes. The tree length is the expected number of substitutions per site along all branches in the phylogeny, calculated as the sum of the branch lengths. This variable was introduced when I was doing simulations to evaluate the effect of sequence divergence while keeping the shape of the tree fixed. `evolver` will scale the tree so that the branch lengths sum up to the specified tree length. If you use option `–1` for the tree length (as shown in the example above), the program will use the branch lengths given in the specified tree without re-scaling. Also, note that the base frequencies have to be **in a fixed order**; this is the same for the amino acid and codon frequencies in `MCaa.dat` and `MCcodon.dat`. It may be useful to run `BASEML` or `CODEML` with some empirical data to extract the estimated values for these parameters so that you can use them in `evolver` to simulate your data. You need to write them in the `evolver` control file in the same order they are output in `BASEML` or `CODEML`. * **Option 8**: this option is used for reading many trees from a tree file and then calculating bipartition distances either between the first and all the remaining trees or between every pair. * **Option 9 (clade support values)**: this option can be used to summarise bootstrap or Bayesian analyses. This option can be used to read two tree files. The first file should include one tree, say, the maximum likelihood tree. You should have the number of species and the number of tree (should be 1) at the beginning of this file (i.e., PHYLIP format, ` 1`). The second tree file should include a collection of trees, such as 1000 maximum likelihood trees estimated from 1000 bootstrap pseudo-samples. This option will then calculate the bootstrap support value for each clade on the ML tree in the first tree file, that is, the proportion of trees in the second file that contain the node or clade in the tree in the first file. The second tree file does not have to have the numbers of species and trees on the first line. If you run other phylogenetic software (e.g., `MrBayes`, `RAxML`, `IQTREE`, etc.), you can move the maximum likelihood tree or maximum a posteriori tree into the first file. In case of `MrBayes`, the second tree file can be the `.t` file generated by `MrBayes`, with no change necessary. You can use the output files from other software with the bootstrap pseudo-samples and input as the second tree file. Right now species are represented by numbers only in the tree files, I think. You can choose this option by running `evolver`, then option 9. The program will then ask you to input two file names. An alternative way, which bypasses the naïve menu, is to put the option and two file names at the command line: ```sh evolver 9 ``` > [!NOTE] > In the command shown above, please replace `` with the path to your main input tree file and `` with the path to your second tree file. ## Simulating phylogenies under relaxed-clock models If you want to **simulate phylogenies under relaxed-clock models**, you should use the **[`simclock` R package](https://github.com/dosreislab/simclock)** developed by Mario dos Reis. All the details regarding installation and usage can be found in the [`simclock` GitHub repository](https://github.com/dosreislab/simclock) as well as in the help manual available in the R package. ## Notes on possible warning/error messages when simulating large genomic datasets If you are trying to simulate large genomic datasets that have more than 1000000 sites (i.e., simulating datasets with $\geq 1000001$ sites), you may encounter one of these two errors or similar: ```text1 # Error 1 *** buffer overflow detected ***: terminated` or a such as ``` ```text # Error 2 (showing just relevant lines, not all the messages printed out on the screen) evolv*** Error in evolver: realloc(): invalid old size: 0x00002b14f12de010 *** [...] requested site pattern counts as output for site-class model. [..]: Aborted ``` In order to fix this problem, you will have to edit the source code [`evolver.c`](https://github.com/abacus-gene/paml/blob/master/src/evolver.c) and recompile. At the time of writing (January 2024), `evolver` assumes a maximum of 1M sites. For site-rates model (such as the dG model), the program uses the space to store the site rates. One double is 8 bytes, so 8M is for 1M sites. That is the part of the code that will need to be updated! You can open [`evolver.c`](https://github.com/abacus-gene/paml/blob/master/src/evolver.c) with your favourite text editor and find the following two lines: ```text sspace = max2(sspace, 8000000); space = (double*)malloc(sspace); ``` Once you have located them, please replace them with the following three lines (i.e., there is a new line in the middle, the first and the third line are the same as above): ```text sspace = max2(sspace, 8000000); sspace = max2(sspace, sizeof(double) * com.ls); space = (double*)malloc(sspace); ``` Once you have made these changes, please save the file and recompile the program. > [!NOTE] > If what you want to simulate site patterns (i.e., you specified option `1` in the first line of the control file), please note that you safely disregard the warning message `requested site pattern counts as output for site-class model`.