-
Notifications
You must be signed in to change notification settings - Fork 7
Subcommand: phat
Generate consensus sequences from a sequence database according to the PhAT method.
Usage: gappa prepare phat [options]
Input | |
---|---|
--taxonomy-file |
Required. TEXT File that lists the taxa of the database. |
--sequence-file |
Required. TEXT Fasta file containing the sequences of the database. |
Taxonomy Expansion | |
--target-size |
Required. UINT Target size of how many taxa to select for building consensus sequences. |
--sub-taxonomy |
TEXT If a taxopath from the taxonomy is provided, only the respective sub-taxonomy is used. |
--min-subclade-size |
UINT Minimal size of sub-clades. Everything below is expanded. |
--max-subclade-size |
UINT Maximal size of a non-expanded sub-clades. Everything bigger is first expanded. |
--min-tax-level |
UINT Minimal taxonomic level. Taxa below this level are always expanded. |
--allow-approximation |
Allow to expand taxa that help getting closer to the --target-size, even if they are not the ones with the highest entropy. |
--no-taxa-selection |
If set, no taxa selection using entropy is performed. Instead, all taxa on all levels/ranks are used and consensus sequences for all of them are calculated. This is useful for testing and to try out new ideas. |
Consensus Method | |
--consensus-method |
TEXT in {cavener,majorities,threshold}=majorities Consensus method to use for combining sequences. |
--consensus-threshold |
FLOAT=0.5 Needs: --consensus-method Threshold value to use with --consensus-method threshold. Has to be in [ 0.0, 1.0 ]. |
Output | |
--out-dir |
TEXT=. Directory to write files to |
--write-info-files |
If set, two additional info files are written, containing the new pruned taxonomy, as well as the entropy of all clades of the original taxonomy. |
Given a set of sequences and a fitting taxonomy, the command produces consensus sequences representing taxonomic clades, according to our PhAT method as described here.
The main inputs are --sequence-file
and --taxonomy-file
, which provide the input data, as well as the --target-size
of how many consensus sequences to build.
After running the command, the resulting set of sequences can be used to infer a reference tree using any tree inference program.
The taxonomy file needs to contain a list of the taxa used for the taxonomic expansion algorithm. Each line of the file lists a semicolon-separated taxonomic clade. Everything after the first tab is ignored.
Example:
Eukaryota; 4 domain
Eukaryota;Amoebozoa; 4052 kingdom 119
Eukaryota;Amoebozoa;Myxogastria; 4094 phylum 119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete; 4095 genus 119
Eukaryota;Amoebozoa;Myxogastria;Badhamia; 4096 genus 119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia; 4097 genus 119
Eukaryota;Amoebozoa;Myxogastria;Comatricha; 4098 genus 119
...
The sequence file needs to be in fasta format, and contain sequences that are labelled with the taxonomic path that they belong to. This taxonomic path can either be the whole label, or everything after the first whitespace (space or tab). This allows to have sequences with unique identifiers as the first part of the label.
For example, sequences in the Silva database are labelled like this:
>AY842031.1.1855 Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete comata
>JQ031957.1.4380 Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia maxima
In the example, the sequences first contain a unique identifier, followed by a space and the taxonomic path the sequence belongs to. The path contains an additional taxonomic level which is not present in the database. If this occurs, the last level is assumed to be species level, and removed from the path. The resulting taxonommic path is part of the taxonomy, and hence the sequence can be used.
If provided with a semicolon-separated taxonomic path (e.g., Eukaryota;Amoebozoa;
), only this subclade is used for the algorithm. That is, the algorithm behaves as if the --taxonomy-file
and --sequence-file
only contained the taxa and sequences of the provided clade.
Module analyze
- correlation
- dispersion
- edgepca
- imbalance-kmeans
- krd
- phylogenetic-kmeans
- placement-factorization
- squash
Module edit
Module examine
Module prepare
Module simulate
Module tools