Subcommand: extract

Extract placements from clades of the tree and write per-clade jplace files.

Usage: gappa edit extract [options]

Options

Input
`--jplace-path`	Required. `TEXT:PATH(existing)=[] ...` List of jplace files or directories to process. For directories, only files with the extension `.jplace[.gz]` are processed.
`--clade-list-file`	Required. `TEXT:FILE` File containing a tab-separated list of taxon to clade mapping.
`--fasta-path`	`TEXT:PATH(existing)=[] ...` List of fasta files or directories to process. For directories, only files with the extension `.(fasta\|fas\|fsa\|fna\|ffn\|faa\|frn)[.gz]` are processed.
Settings
`--threshold`	`FLOAT:FLOAT in [0.5 - 1]=0.95` Threshold of how much placement mass needs to be in a clade for extracting a pquery.
`--exclude-clade-stems`	`FLAG` By default, the branch connecting a specified clade to the rest of the tree is considered part of the clade. With this option, these branches are excluded, and instead considered as basal branches.
`--basal-clade-name`	`TEXT=basal` The name of the clade used for queries that do not fall into one of the specified clades.
`--uncertain-clade-name`	`TEXT=uncertain` The name of the clade used for queries that do not fall into any clade with more than the threshold amount of their mass.
`--point-mass`	`FLAG` Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0.
Output
`--color-tree-file`	`TEXT:PATH(non-existing)` If a path is provided, an svg file with a tree colored by clades is written.
`--samples-out-dir`	`TEXT=samples` Directory to write output samples files to.
`--samples-file-prefix`	`TEXT` File prefix for samples files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--samples-file-suffix`	`TEXT` File suffix for samples files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--sequences-out-dir`	`TEXT=sequences` Directory to write output sequences files to.
`--sequences-file-prefix`	`TEXT` File prefix for sequences files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
`--sequences-file-suffix`	`TEXT` File suffix for sequences files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data.
Global Options
`--allow-file-overwriting`	`FLAG` Allow to overwrite existing output files instead of aborting the command.
`--verbose`	`FLAG` Produce more verbose output.
`--threads`	`UINT` Number of threads to use for calculations.
`--log-file`	`TEXT` Write all output to a log file, in addition to standard output to the terminal.

Description

The command extracts the queries that are placed in specified clades of the reference tree and writes per-clade jplace files.

The command takes one or more jplace files as input, as well as a file describing clades of the reference tree used in the jplace files. It then finds all placements in those clades and writes per-clade placement files, each of them containing only those placements that had more of their mass (likelihood weight ratios) in that clade than specified by --threshold. Furthermore, two special clades are produced: basal, which collects all placements that have their mass on branches that do not belong to any clade, as well as uncertain, which collects placements where no clade (including the basal clade) have more than the threshold amount of the mass in them (i.e., the placement has mass distributed across multiple clades).

Furthermore, if a set of fasta sequence files is provided, the command also creates per-clade fasta files, containing the sequences corresponding to the placements of the jplace files. This of course necessitates that the sequences are named the same as the placements - which is given if the placement files are simply the result of placing the sequences on a reference tree.

Details

The algorithm assigns a clade to each of the branches of the reference tree (either one of the specified ones, or the basal clade). For terminal branches (leaves), the assigned clade is simply as specified in the --clade-list-file. Inner branches are assigned to a clade if all leaves on one side of the split that is induced by the branch belong to the same clade. In other words, all branches of a subtree that contains only taxa from one clade are assigned to that clade. The option --exclude-clade-stems controls whether the branch that connects a clade to the rest of the tree is included in the set of branches for the clade.

See the figure below for an example.

`--clade-list-file`

This file describes which taxa of the reference tree are considered to belong to which clade. Each line of the file needs to contain a taxon name of the tree, and the name of the clade it belongs to, separated by a tab:

AF401522_Carchesium_polypinum	Alveolata
X56165_Tetrahymena_thermophila	Alveolata
X03772_Paramecium_tetraurelia	Alveolata
...

Not all taxa of the reference tree have to be part of the file; all missing ones are simply considered to be part of the special basal clade.

`--color-tree-file`

If provided with an output file name, an svg file is written that shows which branches of the tree were assigned to which clade:

Tree with clades marked in color.

This is useful to verify the process and to make sure that the correct branches were selected. In the figure, the basal branches are gray, while three exemplary clades are marked in color.

The behavior of selecting branches so that their subtrees are monophyletic with respect to a clade is visible here as well: For example, the green clade is split into two subtrees and a few single branches.

Multilevel placement

The comamnd is used in the multilevel placement approach as explained here.

In this workflow, phylogenetic placement is conducted first on a broad tree representing the expected diversity of the sample. After extracting queries (and their reads) from specific clades of interest with this command, a second placement phase is then conducted on trees that contain more representatives of the clades of interest, in order to achieve finer taxonomic resolution.

Multilevel placement extraction workflow.

The command can of course also be used for other purposes where one is interested in just working with placements in a specific clade.

Citation

When using this method, please do not forget to cite

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070

Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767