-
Notifications
You must be signed in to change notification settings - Fork 1
Workflow
The first stage of the workflow is to create two initial input channels: EditFasta
and OldFasta
. The EditFasta
channel is created from params.addFasta
, it should contain new fasta files to be added to the database. The channel uses the mapping .map{ file -> tuple(file.getParent().getName(), file) }
to map each fasta file to the name of the directory it sits in (the fastas should be sorted into directories for each taxon where the directory name is the taxon name separated by underscores). The OldFasta
channel is created from params.previousDatabase
and should contain fastas files from a previous database build, which have already been mapped to their tax ID. If there is no previous database build then params.previousDatabase
should be set to null
, creating an empty channel for OldFasta
.
The main workflow then calls three component workflows: prepareNewFasta
, selectFasta
and krakenBuild
.
data:image/s3,"s3://crabby-images/f31f2/f31f21d581a7c1bc84a9ba9b1f1c0f0cc1cf1374" alt=""
The prepareNewFasta
workflow component takes the new fasta files to be added to the database (i.e. the EditFasta
channel) and looks up the tax ID and adds it to the sequence headers and the filename. Fasta files from previous database builds skip this stage (i.e. the OldFasta
channel).
prepareNewFasta
contains one process called autoDatabase_addTaxon
. This process takes an individual fasta and its taxon/directory name as input and uses the taxadd
script to add the tax ID to the sequence headers and filenames. The taxadd
script uses the NCBI taxonomy names.dmp
to look up the tax ID. The output of autoDatabase_addTaxon
is the tax ID mapped fasta.
Once the prepareNewFasta
workflow component has finished executing, the AllFasta
channel is created containing the output fasta from autoDatabase_addTaxon
and the fasta files from OldFasta
. The fasta are mapped to their tax ID which is scraped from the filename .map{ file -> tuple(file.getName().split("_")[0], file) }
. They are then grouped by tax ID using .groupTuple(sort: true)
.
The selectFasta
workflow component takes the AllFasta
channel and selects high quality assemblies using Mash
. Parallelisation is by taxon.
The first process autoDatabase_mash
has the AllFasta
channel and the tax ID as input, calculating the pairwise mash distances for each taxon and outputting text files containing these distances of the form ${taxid}_mashdist.txt
.
The next process is autoDatabase_qc
, this is the quality control stage of the workflow which selects high quality assemblies which will go on to form the database. This process takes the output text file from autoDatabase_mash
for each taxon and uses the fastaselect
script to output text files listing the high quality assemblies for each taxon. The fastaselect
script builds a mash distance matrix, finds the average distance for each assembly, and finds the mode to 2 s.f. The filenames of assemblies that have an average distance that is within 10% in the mode are then recorded in a text file. These are the high quality assemblies which will go onto build the database. If there are less than three samples for a taxon, then the assembly/assemblies for this taxon will be added to the database with no quality control.
The final process in selectFasta
is autoDatabase_cleanFasta
, it is a serial process which takes the output lists of high quality assemblies from autoDatabase_qc
and the channel AllFasta
and moves the fasta files listed in the text files to the assemblies
directory. This is done to create a channel which contains all the high quality assemblies which can be passed to the kraken database building stage.
The krakenBuild
workflow component takes the high quality assemblies and builds a database using Kraken2
.
The autoDatabase_kraken2Build
process takes the high quality fasta from selectFasta
as input, collecting them using .collect()
, and outputs the .k2d
kraken2 database files. The script for autoDatabase_kraken2Build
first downloads the taxonomy for May 2020 from the NCBI and ammends taxon information for Mycobacterium tomidae in names.dmp
and nodes.dmp
(as tomidae is currently absent from the NCBI taxonomy, it is assigned a tax ID of the largest taxID value +1) and then moves names.dmp
and nodes.dmp
to the taxonomy
directory. The fasta files are then added to the kraken2 library using kraken2-build --add-to-library
, once all the files have been added to the library, the database is then built using kraken2-build --build
with the number of cpus set to 24 using the cpus 24
directive.