Parallelization

Intro

As of v.0.7.0 beta, RAxML-NG only supports fine-grained parallelization across alignment sites. This is the same parallelization approach that has been used in RAxML-PTHREADS and ExaML, and conceptually different from coarse-grained parallelization across tree searches or tree moves as implemented in RAxML-MPI or IQTree-MPI, respectively. With fine-grained parallelization, the number of CPU cores that can be efficiently utilized is limited by the alignment "width" (=number of sites). For instance, using 20 cores on a single-gene protein alignment with 300 sites would be suboptimal, and using 100 cores would most probably result in a huge slowdown. In order to prevent the waste of CPU resources, RAxML-NG will warn you -- or, in extreme cases, even refuse to run -- if you try to assign too few alignment sites per core.

If you want to utilize many CPU cores for analyzing a small ("single-gene") alignment, please have a look at our ParGenes pipeline which implements coarse-grained parallelization and dynamic load balancing. Alternatively, coarse-grained parallelization can be easily emulated by starting multiple RAxML-NG instances with distinct random seeds. For instance, one can infer 1000 bootstrap trees by calling RAxML-NG five times with --bs-trees 200 (e.g., on five distinct compute nodes). Then, all resulting *.raxml.bootstraps files can be simply concatenated.

So how many threads/cores should I use?

Multithreading

By default, RAxML-NG will start as many threads as there are CPU cores available in your system.

MPI

Vector instructions

RAxML-NG will automatically detect the best set of vector instructions available on your CPU, and use the respective computational kernels to achieve optimal performance. On modern Intel CPUs, autodetection seems to work pretty well, so most probably you don't need to worry about this. However, you can force RAxML-NG to use a specific set of vector instructions with the --simd option, e.g.

raxml-ng --msa ali.fa --model GTR+G --simd sse

to use SSE3 kernels, or

raxml-ng --msa ali.fa --model GTR+G --simd none

to use non-vectorized (scalar) kernels. This option might be useful for debugging, since distinct vectorized kernel might yield slightly different likelihood values due to numerical reasons.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization

Intro

So how many threads/cores should I use?

Multithreading

MPI

Vector instructions

Clone this wiki locally