Skip to content

Subcommand: phat

Lucas Czech edited this page Aug 24, 2018 · 9 revisions

Generate consensus sequences from a sequence database according to the PhAT method.

Usage: gappa prepare phat [options]

Options

Input
--taxonomy-file Required. TEXT
File that lists the taxa of the database.
--sequence-file Required. TEXT
Fasta file containing the sequences of the database.
Taxonomy Expansion
--target-size Required. UINT
Target size of how many taxa to select for building consensus sequences.
--sub-taxonomy TEXT
If a taxopath from the taxonomy is provided, only the respective sub-taxonomy is used.
--min-subclade-size UINT
Minimal size of sub-clades. Everything below is expanded.
--max-subclade-size UINT
Maximal size of a non-expanded sub-clades. Everything bigger is first expanded.
--min-tax-level UINT
Minimal taxonomic level. Taxa below this level are always expanded.
--allow-approximation Allow to expand taxa that help getting closer to the --target-size, even if they are not the ones with the highest entropy.
--no-taxa-selection If set, no taxa selection using entropy is performed. Instead, all taxa on all levels/ranks are used and consensus sequences for all of them are calculated. This is useful for testing and to try out new ideas.
Consensus Method
--consensus-method TEXT in {cavener,majorities,threshold}=majorities
Consensus method to use for combining sequences.
--consensus-threshold FLOAT=0.5 Needs: --consensus-method
Threshold value to use with --consensus-method threshold. Has to be in [ 0.0, 1.0 ].
Output
--out-dir TEXT=.
Directory to write files to
--write-info-files If set, two additional info files are written, containing the new pruned taxonomy, as well as the entropy of all clades of the original taxonomy.

Description

Given a set of sequences and a fitting taxonomy, the command produces consensus sequences representing taxonomic clades, according to our PhAT method as described here. The main inputs are --sequence-file and --taxonomy-file, which provide the input data, as well as the --target-size of how many consensus sequences to build.

PhAT workflow.

After running the command, the resulting set of sequences can be used to infer a reference tree using any tree inference program.

Details

--taxonomy-file

The taxonomy file needs to contain a list of the taxa used for the taxonomic expansion algorithm. Each line of the file lists a semicolon-separated taxonomic clade. Everything after the first tab is ignored.

Example:

Eukaryota;	4	domain
Eukaryota;Amoebozoa;	4052	kingdom		119
Eukaryota;Amoebozoa;Myxogastria;	4094	phylum		119
Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;	4095	genus		119
Eukaryota;Amoebozoa;Myxogastria;Badhamia;	4096	genus		119
Eukaryota;Amoebozoa;Myxogastria;Brefeldia;	4097	genus		119
Eukaryota;Amoebozoa;Myxogastria;Comatricha;	4098	genus		119
...

--sequence-file

The sequence file needs to be in fasta format, and contain sequences that are labelled with the taxonomic path that they belong to. This taxonomic path can either be the whole label, or everything after the first whitespace (space or tab). This allows to have sequences with unique identifiers as the first part of the label.

For example, sequences in the Silva database are labelled like this:

>AY842031.1.1855 Eukaryota;Amoebozoa;Myxogastria;Amaurochaete;Amaurochaete comata
>JQ031957.1.4380 Eukaryota;Amoebozoa;Myxogastria;Brefeldia;Brefeldia maxima

In the example, the sequences first contain a unique identifier, followed by a space and the taxonomic path the sequence belongs to. The path contains an additional taxonomic level which is not present in the database. If this occurs, the last level is assumed to be species level, and removed from the path. The resulting taxonommic path is part of the taxonomy, and hence the sequence can be used.

--sub-taxonomy

If provided with a semicolon-separated taxonomic path (e.g., Eukaryota;Amoebozoa;), only this subclade is used for the algorithm. That is, the algorithm behaves as if the --taxonomy-file and --sequence-file only contained the taxa and sequences of the provided clade.

Clone this wiki locally