The goal of this project is writing an own implementation of a taxonomic assignment task. First, a real NCBI taxonomic reference (nodes.dmp) is transformed into a tree. Afterwords, a file is read containing reads of genomes (a set of taxes compatible to a new genomic sequence). An algorithm is then implemented to find the best taxonomic assignement for the given genomic sequence based on maximum F-measure.
- Extract folders and navigate to this folder in CMD.
- Download nodes.dmp (Taxonomy dataset) from ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
- Make sure 'nodes.dmp' (taxonomy dataset) and 'sample.inp' (genome reads) are in the same directory!
- Execute: 'python full_pipeline.py'
--> To change input (location or name), run 'python full_pipeline.py -h' for instructions.