Skip to content

Input data

Alexey Kozlov edited this page Dec 4, 2018 · 57 revisions

Analysis type

RAxML-NG supports several types of analysis, which can be selected by specifying a corresponding command:

Command RAxML 8.x
equivalent
Meaning
--search -f d Run topology search to find the best-scoring ML tree (default)
--evaluate -f e Optimize model parameters and/or branch lengths on a fixed tree topology
--loglh N/A Compute log-likelihood of a given tree without any optimization.
--bootstrap -b Run non-parametric bootstrap analysis (equivalent to 'slow' bootstrapping in RAxML). Number of bootstrap replicates and other parameters can be changed with respective options.
--all -f a* Combined tree search and bootstrapping analysis; bootstrap support values will be plotted onto the best-scoring ML tree.
--support -f b Compute bipartition support for a given reference tree (e.g., best ML tree) using an existing set of replicate trees (e.g., bootstrap trees obtained with --bootstrap option above). Usage:
raxml-ng --support --tree bestML.tree --bs-trees bootstraps.tree
--bsconverge -I A posteriori bootstrap convergence test. Usage:
raxml-ng --bsconverge --bs-trees bootstraps.tree --bs-cutoff 0.03
--check -f c Check alignment file and remove any columns consisting entirely of gaps
--parse N/A Parse alignment, compress patterns and create binary MSA file
--start -y Generate parsimony/random starting trees and exit
--terrace N/A Check whether a tree lies on a phylogenetic terrace. Usage:
raxml-ng --terrace --tree best.tre --msa ali.fa --model partition.txt

* Unlike in RAxML 8.x, this command will perform 'slow' bootstrapping procedure.

Multiple sequence alignment

Option: --msa FILE (mandatory)

RAxML-NG supports alignments in FASTA, non-interleaved PHYLIP and CATG formats.

By default, RAxML-NG will try to automatically detect alignment format based on the file contents. Usually this works just fine, but you can also specify the alignment format explicitly with the --msa-format option.

Evolutionary model

Option: --model STRING | FILE (mandatory)

Evolutionary model can be specified globally (i.e., for the whole alignment), or multiple models can be selected for different subsets of alignment columns (so called partitioned analysis).

Single model

Global per-alignment evolutionary model can be given as a string on the command line. Model specification always starts with a substitution matrix name, e.g., GTR for DNA data or LG for protein data. Several optional modifiers can be added, separated by + and in arbitrary order. This notation is inspired by -- and mostly compatible with -- model specification in the IQ-Tree program (Nguyen et al. 2015).

NOTE: all per-state values (e.g. base frequencies) must be given in the following order.

All substitution matrices and modifiers are summarized in the following table:

Modifier Possible values
Substitution matrix
DNA data: JC, K80, F81, HKY, TN93ef, TN93, K81, K81uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1, TIM1uf, TIM2, TIM2uf, TIM3, TIM3uf,TVMef, TVM, SYM, GTR
Protein data*: Dayhoff, LG, DCMut, JTT, mtREV, WAG, RtREV, CpREV, VT, Blosum62, MtMam, MtArt, MtZoa, PMB, HIVb,HIVw, JTT-DCMut, FLU, StmtREV, LG4M (implies +G4), LG4X (implies +R4), PROTGTR
Binary data (0/1): BIN
Morphological/multistate: MULTIx_MK, MULTIx_GTR (where x = number of states, e.g.: MULTI8_MK for a 8-state model with equal rates) state encoding
Unphased diploid genotypes (10 states): GTJC GTHKY4 GTGTR4 GTGTR
Fixed user-defined rates: e.g. HKY{1.0/2.5} or GTR{0.5/2.0/1.0/1.2/0.1/1.0}
Stationary frequencies +F or +FC (empirical)
+FO (ML estimate)
+FE (equal)
+FU{f1/f2/../fn} (user-defined: f1 f2 ... fn)
Proportion of
invariant sites
+I or +IO (ML estimate)
+IC (empirical)
+IU{p} (user-defined: p)
Among-site rate
heterogeneity model
+G (discrete GAMMA with 4 categories, mean category rates, ML estimate of alpha)
+GA (as above, but with median category rates)
+Gn (discrete GAMMA with n categories, ML estimate of alpha)
+Gn{a} (discrete GAMMA with n categories and user-defined alpha a)
+Rn (FreeRate with n categories, ML estimate of rates and weights)
+Rn{r1/r2/../rn}{w1/w2/../wn} (FreeRate with n categories, user-defined rates r1 r2 ... rn and weights w1 w2 ... wn)
Ascertainment bias
correction
+ASC_LEWIS (Lewis' method)
+ASC_FELS{w} (Felsenstein's method with total number of invariable sites w)
+ASC_STAM{w1/w2/../wn} (Stamatakis' method with per-state invariable site numbers w1 w2 ... wn)

* see libpll wiki for details & references

Multiple models

Multiple models can be defined in a RAxML-style partition file. Example:

JC+G, p1 = 1-100, 252-400
HKY+F, p2 = 101-180, 251
GTR+I, p3 = 181-250

Here, each line defines a partition and consist of three elements:

  • model specification (see above)
  • partition name
  • range of alignment columns

NOTE: In RAxML, certain model modifiers were global (e.g., GAMMA model of rate heterogeneity), and thus they were specified on the command line and not in partition file. In RAxML-NG, this limitation was lifted, i.e. it is now possible to combine partitions with and without GAMMA, proportion of invariant sites etc. (as in example above). However, this means that RAxML partition files might need to be adjusted for RAxML-NG (e.g., by adding+G for the partitions where GAMMA model of rate heterogeneity should be used).

Branch length linkage

In case of partitioned analysis, three branch length estimation modes are available:

Command Meaning
--brlen linked Branch lengths are identical for all partitions (default)
--brlen scaled Joint branch length estimation with individual per-partition scalers (i.e., branch lengths are proportional)
--brlen unlinked Branch lengths are estimated independently for each partition (cf. RAxML -M option)

Starting tree(s)

Option: --tree rand{N} | pars{N} | FILE

RAxML-NG supports three types of starting trees:

  • rand(om): start from a random topology
  • pars(imony): start from a tree generated by the parsimony-based randomized stepwise addition algorithm
  • user-defined: load a custom starting tree from the NEWICK file

For random and parsimony, you can specify the number of trees to generate in curly brackets (e.g., pars{10} or rand{20}). In this case, RAxML-NG will perform multiple tree searches (one per each starting tree), and pick the best-scoring topology as the final ML tree. You can also combine both parsimony and random starting trees in one run, e.g. --tree pars{10},rand{10}.

Default number of starting trees depends on RAxML-NG version and command:

RAxML-NG v0.7.0b

Command Meaning
--search 1 random
--all 10 random + 10 parsimony

RAxML-NG v0.7.0git >= 13.11.2018

Command Meaning
--search 10 random + 10 parsimony
--search1 1 random
--all 10 random + 10 parsimony

Topological constraint

Option: --constraint-tree FILE

You can specify a constraint tree to e.g. enforce monophyly of certain groups (equivalent to the -g option in RAxML8). If the constraint tree is comprehensive (i.e., it includes all taxa found in the MSA), then RAxML will simply resolve polytomies in the way that maximizes the likelihood. Conversely, if some taxa are missing from the constraint, they will be placed freely in the resulting ML tree.

State encoding & order

Data type Order
DNA A C G T
PROTEIN A R N D C Q E G H I L K M F P S T W Y V
MULTISTATE 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ! \ " # $ % & ' ( ) * + , / : ; < = > @ [ \ ] ^ _ { | } ~
GENOTYPE (diploid unphased) A C G T M R W S Y K
(Meaning: A/A C/C G/G T/T A/C A/G A/T C/G C/T G/T)