-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: extract
Extract placements from clades of the tree and write per-clade jplace files.
Usage: gappa edit extract [options]
Input | |
---|---|
--jplace-path |
Required. TEXT:PATH(existing)=[] ... List of jplace files or directories to process. For directories, only files with the extension .jplace[.gz] are processed. |
--clade-list-file |
Required. TEXT:FILE File containing a tab-separated list of taxon to clade mapping. |
--fasta-path |
TEXT:PATH(existing)=[] ... List of fasta files or directories to process. For directories, only files with the extension .(fasta|fas|fsa|fna|ffn|faa|frn)[.gz] are processed. |
Settings | |
--threshold |
FLOAT:FLOAT in [0.5 - 1]=0.95 Threshold of how much placement mass needs to be in a clade for extracting a pquery. |
--exclude-clade-stems |
FLAG By default, the branch connecting a specified clade to the rest of the tree is considered part of the clade. With this option, these branches are excluded, and instead considered as basal branches. |
--basal-clade-name |
TEXT=basal The name of the clade used for queries that do not fall into one of the specified clades. |
--uncertain-clade-name |
TEXT=uncertain The name of the clade used for queries that do not fall into any clade with more than the threshold amount of their mass. |
--point-mass |
FLAG Treat every pquery as a point mass concentrated on the highest-weight placement. In other words, ignore all but the most likely placement location (the one with the highest LWR), and set its LWR to 1.0. |
Output | |
--color-tree-file |
TEXT:PATH(non-existing) If a path is provided, an svg file with a tree colored by clades is written. |
--samples-out-dir |
TEXT=samples Directory to write output samples files to. |
--samples-file-prefix |
TEXT File prefix for samples files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--samples-file-suffix |
TEXT File suffix for samples files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--sequences-out-dir |
TEXT=sequences Directory to write output sequences files to. |
--sequences-file-prefix |
TEXT File prefix for sequences files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
--sequences-file-suffix |
TEXT File suffix for sequences files. Most gappa commands use the command name as the base name for file output. This option amends the base name, to distinguish runs with different data. |
Global Options | |
--allow-file-overwriting |
FLAG Allow to overwrite existing output files instead of aborting the command. |
--verbose |
FLAG Produce more verbose output. |
--threads |
UINT Number of threads to use for calculations. |
--log-file |
TEXT Write all output to a log file, in addition to standard output to the terminal. |
The command extracts the queries that are placed in specified clades of the reference tree and writes per-clade jplace
files.
The command takes one or more jplace files as input, as well as a file describing clades of the reference tree used in the jplace files.
It then finds all placements in those clades and writes per-clade placement files, each of them containing only those placements that had more of their mass (likelihood weight ratios) in that clade than specified by --threshold
.
Furthermore, two special clades are produced: basal
, which collects all placements that have their mass on branches that do not belong to any clade, as well as uncertain
, which collects placements where no clade (including the basal clade) have more than the threshold amount of the mass in them (i.e., the placement has mass distributed across multiple clades).
Furthermore, if a set of fasta sequence files is provided, the command also creates per-clade fasta files, containing the sequences corresponding to the placements of the jplace files. This of course necessitates that the sequences are named the same as the placements - which is given if the placement files are simply the result of placing the sequences on a reference tree.
The algorithm assigns a clade to each of the branches of the reference tree (either one of the specified ones, or the basal clade).
For terminal branches (leaves), the assigned clade is simply as specified in the --clade-list-file
.
Inner branches are assigned to a clade if all leaves on one side of the split that is induced by the branch belong to the same clade. In other words, all branches of a subtree that contains only taxa from one clade are assigned to that clade. The option --exclude-clade-stems
controls whether the branch that connects a clade to the rest of the tree is included in the set of branches for the clade.
See the figure below for an example.
This file describes which taxa of the reference tree are considered to belong to which clade. Each line of the file needs to contain a taxon name of the tree, and the name of the clade it belongs to, separated by a tab:
AF401522_Carchesium_polypinum Alveolata
X56165_Tetrahymena_thermophila Alveolata
X03772_Paramecium_tetraurelia Alveolata
...
Not all taxa of the reference tree have to be part of the file; all missing ones are simply considered to be part of the special basal clade.
If provided with an output file name, an svg file is written that shows which branches of the tree were assigned to which clade:
This is useful to verify the process and to make sure that the correct branches were selected. In the figure, the basal branches are gray, while three exemplary clades are marked in color.
The behavior of selecting branches so that their subtrees are monophyletic with respect to a clade is visible here as well: For example, the green clade is split into two subtrees and a few single branches.
The comamnd is used in the multilevel placement approach as explained here.
In this workflow, phylogenetic placement is conducted first on a broad tree representing the expected diversity of the sample. After extracting queries (and their reads) from specific clades of interest with this command, a second placement phase is then conducted on trees that contain more representatives of the clades of interest, in order to achieve finer taxonomic resolution.
The command can of course also be used for other purposes where one is interested in just working with placements in a specific clade.
When using this method, please do not forget to cite
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Genesis and Gappa: Processing, Analyzing and Visualizing Phylogenetic (Placement) Data. Bioinformatics, 2020. doi:10.1093/bioinformatics/btaa070
Lucas Czech, Pierre Barbera, Alexandros Stamatakis. Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement. Bioinformatics, 2018. doi:10.1093/bioinformatics/bty767
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools