Subcommand: chunkify

Chunkify a set of fasta files and create abundance maps.

Usage: gappa prepare chunkify [options]

Options

Sequence Input
`--fasta-path`	Required. `TEXT ...` List of fasta files or directories to process. For directories, only files with the extension .(fasta\|fas\|fsa\|fna\|ffn\|faa\|frn) are processed.
Settings
`--chunk-size`	`UINT=50000` Number of sequences per chunk file.
`--min-abundance`	`UINT=1` Minimum abundance of a single sequence. Sequences below are filtered out.
`--hash-function`	`TEXT in {MD5,SHA1,SHA256}=SHA1` Hash function for re-naming and identifying sequences.
Output
`--chunks-out-dir`	`TEXT=.` Directory to write chunks files to
`--chunk-file-prefix`	`TEXT=chunk_` File prefix for chunk files
`--abundances-out-dir`	`TEXT=.` Directory to write abundances files to
`--abundance-file-prefix`	`TEXT=abundances_` File prefix for abundance files

Description

The command takes one or more fasta files as input, e.g., each representing an environmental sample. It then writes out numbered chunks files of equal size, containing the unique sequences of the input. For each input file, it also writes an abundance map file, which stores the per-sequence abundances in the input. In order to identify unique sequences, it uses a hash value of the sequence data, which is also assigned as a new name to the sequences in the chunks.

The produced chunk files are intended to be used with phylogenetic placement next (after potentially aligning them first to the reference). Using chunks of equal size ensures relatively stable run times for each chunk, so that large datasets can be processed efficiently on a computer cluster. Furthermore, as the chunks only contain unique sequences, compute time is further reduced.

After finishing phylogentic placement, the unchunkify command then takes the per-chunk placement files as well as the abundance map files produced here, and creates placement files for each of the original input files, with all abundances and original sequences names restored. Thus, the combination of these two commands achieves the same effect as placing each input file separately, but lowers computational cost and maximizes load balancing.

Home

Citation and References

General Usage

Phylogenetic Placement

Module analyze

Module edit

Module examine

Module prepare

Module simulate

Module tools

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subcommand: chunkify

Options

Description

Clone this wiki locally