Skip to content

Input data

Alexey Kozlov edited this page May 9, 2023 · 57 revisions

Analysis type

RAxML-NG supports several types of analysis, which can be selected by specifying a corresponding command:

Command RAxML 8.x
equivalent
Meaning
--all -f a* Combined tree search and bootstrapping analysis; bootstrap support values will be plotted onto the best-scoring ML tree.
--ancestral -f A Marginal ancestral state reconstruction. Usage:
raxml-ng --ancestral --msa ali.phy --tree best.tre --model HKY --prefix ASR
--bootstrap -b Run non-parametric bootstrap analysis (equivalent to 'slow' bootstrapping in RAxML). Number of bootstrap replicates and other parameters can be changed with respective options.
--bsconverge -I A posteriori bootstrap convergence test. Usage:
raxml-ng --bsconverge --bs-trees bootstraps.tree --bs-cutoff 0.03
--bsmsa -f j Generate bootstrap replicate alignments. Usage:
raxml-ng --bsmsa --msa ali.fa --model partition.txt --bs-trees 100
NOTE: you can also use --bs-write-msa flag in combination with --all or --bootstrap command to to write out replicates
--check -f c Check alignment file and remove any columns consisting entirely of gaps
--consense -J Build a consensus tree. Usage:
raxml-ng --consense MR --tree alltrees.nw --prefix CONS
--evaluate -f e Optimize model parameters and/or branch lengths on a fixed tree topology
--loglh N/A Compute log-likelihood of a given tree without any optimization.
--parse N/A Parse alignment, compress patterns and create binary MSA file
--rfdist -f r Compute Robinson-Foulds (RF) distance between trees. Usage:
raxml-ng --rfdist --tree alltrees.nw --prefix RF
--search -f d Run topology search to find the best-scoring ML tree (default)
--sitelh -f g Print per-site log-likelihood values (NEW v1.0) Usage:
raxml-ng --sitelh --msa ali.phy --tree best.tre --model HKY --prefix siteLH
--support -f b Compute bipartition support for a given reference tree (e.g., best ML tree) using an existing set of replicate trees (e.g., bootstrap trees obtained with --bootstrap option above). Usage:
raxml-ng --support --tree bestML.tree --bs-trees bootstraps.tree
--start -y Generate parsimony/random starting trees and exit
--terrace N/A Check whether a tree lies on a phylogenetic terrace. Usage:
raxml-ng --terrace --tree best.tre --msa ali.fa --model partition.txt

* Unlike in RAxML 8.x, this command will perform 'slow' bootstrapping procedure.

Multiple sequence alignment

Option: --msa FILE (mandatory)

RAxML-NG supports alignments in FASTA, PHYLIP and CATG formats.

By default, RAxML-NG will try to automatically detect alignment format based on the file contents. Usually this works just fine, but you can also specify the alignment format explicitly with the --msa-format option.

Evolutionary model

Option: --model STRING | FILE (mandatory)

Evolutionary model can be specified globally (i.e., for the whole alignment), or multiple models can be selected for different subsets of alignment columns (so called partitioned analysis).

Single model

Global per-alignment evolutionary model can be given as a string on the command line. Model specification always starts with a substitution matrix name, e.g., GTR for DNA data or LG for protein data. Several optional modifiers can be added, separated by + and in arbitrary order. This notation is inspired by -- and mostly compatible with -- model specification in the IQ-Tree program (Nguyen et al. 2015).

NOTE: all per-state values (e.g. base frequencies) must be given in the following order.

All substitution matrices and modifiers are summarized in the following table:

Modifier Possible values
Substitution matrix
DNA data: JC, K80, F81, HKY, TN93ef, TN93, K81, K81uf, TPM2, TPM2uf, TPM3, TPM3uf, TIM1, TIM1uf, TIM2, TIM2uf, TIM3, TIM3uf,TVMef, TVM, SYM, GTR -> see details
Protein data*: Blosum62, cpREV, Dayhoff, DCMut, DEN, FLU, HIVb,HIVw, JTT, JTT-DCMut, LG, mtART,mtMAM, mtREV, mtZOA, PMB, Q.pfam, Q.bird, Q.insect, Q…mammal, Q.plant, Q.yeast (ref.), rtREV,stmtREV, VT, WAG, LG4M (implies +G4), LG4X (implies +R4), PROTGTR
Binary data (0/1): BIN
Morphological/multistate: MULTIx_MK, MULTIx_GTR (where x = number of states, e.g.: MULTI8_MK for a 8-state model with equal rates) state encoding
Unphased diploid genotypes (10 states): GTJC GTHKY4 GTGTR4 GTGTR
User-defined symmetries: e.g. DNA010010 (equivalent to HKY) or MULTI5_USERabcdeabcde
Fixed user-defined rates: e.g. HKY{1.0/2.5} or GTR{0.5/2.0/1.0/1.2/0.1/1.0} or PROTGTR{rates.txt}. The rates above define upper triangle of the substitution matrix, e.g. for GTR the order is A-C, A-G, A-T, C-G, C-T, G-T
PAML format: e.g. PROTGTR{paml.txt} (lower triangle of substitution matrix + equilibrium frequencies in a single text file, example)
Stationary frequencies +F or +FC (empirical)
+FO (ML estimate)
+FE (equal)
+FU{f1/f2/../fn} (user-defined: f1 f2 ... fn)
+FU{freqs.txt} (user-defined from file)
Proportion of
invariant sites
+I or +IO (ML estimate)
+IC (empirical)
+IU{p} (user-defined: p)
Among-site rate
heterogeneity model
+G or +G4m (discrete GAMMA with 4 categories, mean category rates, ML estimate of alpha)
+GA (as above, but with median category rates)
+Gn (discrete GAMMA with n categories, ML estimate of alpha)
+Gn{a} (discrete GAMMA with n categories and user-defined alpha a)
+Rn (FreeRate with n categories, ML estimate of rates and weights)
+Rn{r1/r2/../rn}{w1/w2/../wn} (FreeRate with n categories, user-defined rates r1 r2 ... rn and weights w1 w2 ... wn)
Ascertainment bias
correction
+ASC_LEWIS (Lewis' method)
+ASC_FELS{w} (Felsenstein's method with total number of invariable sites w)
+ASC_STAM{w1/w2/../wn} (Stamatakis' method with per-state invariable site numbers w1 w2 ... wn)
NOTE: When using +ASC models, you have to remove all invariant sites from the MSA!
Custom
character-to-state mapping
+M{statechars}{gapchars} e.g. MULTI6_GTR+M{ABCDEF}{X-?}
+Mi{statechars}{gapchars} same as above, but statechars are case-insensitive
+M{charmap.txt} mapping defined in charmap.txt file (see below)

* see libpll wiki for details & references

Multiple models

Multiple models can be defined in a RAxML-style partition file. Example:

JC+G, p1 = 1-100, 252-400
HKY+F, p2 = 101-180, 251
GTR+I, p3 = 181-250

Here, each line defines a partition and consist of three elements:

  • model specification (see above)
  • partition name
  • range of alignment columns

NOTE: In RAxML, certain model modifiers were global (e.g., GAMMA model of rate heterogeneity), and thus they were specified on the command line and not in partition file. In RAxML-NG, this limitation was lifted, i.e. it is now possible to combine partitions with and without GAMMA, proportion of invariant sites etc. (as in example above). However, this means that RAxML partition files might need to be adjusted for RAxML-NG (e.g., by adding+G for the partitions where GAMMA model of rate heterogeneity should be used).

Branch length linkage

In case of partitioned analysis, three branch length estimation modes are available:

Command Meaning
--brlen linked Branch lengths are identical for all partitions
--brlen scaled (default) Joint branch length estimation with individual per-partition scalers (i.e., branch lengths are proportional)
--brlen unlinked Branch lengths are estimated independently for each partition (cf. RAxML -M option)

Starting tree(s)

Option: --tree rand{N} | pars{N} | FILE

RAxML-NG supports three types of starting trees:

  • rand(om): start from a random topology
  • pars(imony): start from a tree generated by the parsimony-based randomized stepwise addition algorithm
  • user-defined: load a custom starting tree from the NEWICK file

For random and parsimony, you can specify the number of trees to generate in curly brackets (e.g., pars{10} or rand{20}). In this case, RAxML-NG will perform multiple tree searches (one per each starting tree), and pick the best-scoring topology as the final ML tree. You can also combine both parsimony and random starting trees in one run, e.g. --tree pars{10},rand{10}.

Default number of starting trees depends on RAxML-NG version and command:

Command v1.2.0 and later v0.8.0 and later v0.7.0 and before
--search 10 random + 10 parsimony 10 random + 10 parsimony 1 random
--search1 1 parsimony 1 random N/A
--all 10 random + 10 parsimony 10 random + 10 parsimony 10 random + 10 parsimony
--bootstrap up to 1000 parsimony up to 1000 random 100 random

Topological constraint

Option: --tree-constraint FILE

You can specify a constraint tree to e.g. enforce monophyly of certain groups (equivalent to the -g option in RAxML8). If the constraint tree is comprehensive (i.e., it includes all taxa found in the MSA), then RAxML will simply resolve polytomies in the way that maximizes the likelihood. Conversely, if some taxa are missing from the constraint, they will be placed freely in the resulting ML tree.

Outgroup rooting

Option: --outgroup o1,o2,..,oN

You can specify an outgroup which RAxML-NG will use to root the inferred ML tree. It can be a single taxon (--outgroup Human) or a list of taxa which form a monophyletic group (--outgroup Human,Chimp,Gorilla).

Please note that outgroup rooting is just a drawing option and will not affect tree inference process in any way!

Alignment column weights

Option: --site-weights FILE (NEW v1.0)

You can specify a text file with (external) alignment column weights, which should be positive integer numbers. For instance, for a 5-column alignment, the weight file could be:

10 2 4 1 10

Before using this option, please make sure you understand how it works! In particular, it does not only multiply per-site log-likelihoods with the corresponding weights. Instead, specifying a weight file above would result in effectively analyzing an expanded alignment which has 1st column duplicated 10 times, 2nd column duplicated 2 times, 3rd column duplicated 4 times etc. So apart from per-site likelihoods, column weights will also affect bootstrap replicate generation, parsimony starting trees, equilibrium state frequencies etc. Please deliberate whether this treatment makes sense for your data/intended use case.

NOTE: This is equivalent of the -a option of RAxML 8.x, which also used to work as described above!

NOTE: Please always try to normalize site weights to avoid large values (say, >1000), since they can cause numerical problems and reduce speed (e.g., in parsimony computation).

State encoding & order

Defaults

Data type Order
DNA A C G T
PROTEIN A R N D C Q E G H I L K M F P S T W Y V
MULTISTATE 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ! " # $ % & ' ( ) * + , / : ; < = > @ [ \ ] ^ _ { | } ~
GENOTYPE (diploid unphased) A C G T M R W S Y K
(Meaning: A/A C/C G/G T/T A/C A/G A/T C/G C/T G/T)

User-defined state encoding

RAxML-NG allows to define a custom state encoding -- that is, a mapping between (observed) alignment characters and (internal) model states -- via a file in the format described below. This format supports ambiguities and synonyms. For instance, classical encoding for DNA data can be specified as followed:

19    4
ACGTURYSWKMBDHVN-.?
Ade Cyt Gua Thy
A    1,0,0,0
C    0,1,0,0
G    0,0,1,0
T    0,0,0,1
U    0,0,0,1
R    1,0,1,0
Y    0,1,0,1
S    0,1,1,0
W    1,0,0,1
K    0,0,1,1
M    1,1,0,0
B    0,1,1,1
D    1,0,1,1
H    1,1,0,1
V    1,1,1,0
N    1,1,1,1
-    1,1,1,1
.    1,1,1,1
?    1,1,1,1

Here,

  • line #1: number of observed states (16 + 3 extra gaps) and number of model states (4: Adenine, Cytosine, Guanine and Thymine/Uracil)
  • line #2: characters used to encode 19 observerd states in the alignment (in arbitrary order)
  • line #3: space-separated codes for 4 model states. Please note, that those codes are used for output purposes only. However, the order of states is important, since it will be used to interpret the following lines in this file, as well as user-defined state frequencies (+FU), substitution rates etc.
  • lines #4 to #23: mapping each observed state to a subset of model states (likelihood vector)

In order to define the custom state encoding, this character mapping file must be used with the +M model modifier (see above).

CATG file format

The CATG format is a simple text-based format for representing sequence uncertainty, which was originally proposed by Deren Eaton, the author of the PyRAD software. The format is similar to 'transposed' PHYLIP (alignment sites are given in rows instead of columns), and allows to specify per-state likelihoods for each alignment position. A CATG file starts with a two-line header: the first line contains the number of taxa (n) and alignment sites (m), and the second line – a tab-separated list of n taxon names. The following m lines contain the actual alignment data for sites 1 to m. In these lines, columns 2 to n contain comma-separated lists of per-state likelihoods for the respective taxa (in the same order as given in the 2nd header line). The first column contains a consensus state for each taxon in the IUPAC encoding. Although this consensus information is redundant, it improves the readability of the file. Likelihood values must be given in the C, A, T, G order for DNA, for other datatypes please see here.

Sample CATG file:

5 6
taxon1	taxon2	taxon3	taxon4	taxon5
GGGGG	0.1,0.1,0.3,0.5	0.1,0.3,0.2,0.4	0.3,0.3,0.0,0.4	0.0,0.2,0.1,0.7	0.3,0.3,0.0,0.4
GGGTT	0.0,0.0,0.3,0.7	0.2,0.2,0.1,0.5	0.1,0.3,0.1,0.5	0.3,0.0,0.5,0.2	0.1,0.1,0.4,0.4
GGGCG	0.1,0.3,0.1,0.5	0.3,0.3,0.1,0.3	0.2,0.0,0.3,0.5	0.5,0.0,0.1,0.4	0.2,0.2,0.2,0.4
GGGGG	0.1,0.0,0.2,0.7	0.0,0.1,0.3,0.6	0.3,0.2,0.1,0.4	0.1,0.3,0.2,0.4	0.0,0.3,0.2,0.5
GGGGT	0.2,0.2,0.1,0.5	0.2,0.2,0.2,0.4	0.1,0.2,0.3,0.4	0.3,0.0,0.2,0.5	0.0,0.1,0.7,0.2
GAAGT	0.3,0.0,0.3,0.4	0.0,0.6,0.0,0.4	0.3,0.4,0.1,0.2	0.3,0.1,0.0,0.6	0.0,0.3,0.5,0.2