This repository contains code for testing mykrobe species and lineage calls, and results of the testing. It is intended for mykrobe developers, for testing mykrobe species/lineage calls and tracking the results.
There are two directories:
Python/
: this contains the code, and a Singularity definition file that makes a container with the code plus the dependencies.Analysis/
: contains results of testing mykrobe species and lineage calls.
For results, please see the readme in the Analysis/
directory.
This repository has a main script called mlt
(acronym for "mykrobe lineage
test", yes we are testing species calls as well but
"mykrobe lineage species test" seemed like a bad name!).
The easiest way is to build a singularity container.
cd Python
sudo singularity build mlt Singularity.def
Run like that, singularity will make a container file called mlt
.
You can just treat it as an normal executable, no need to run
singularity exec mlt
unless you want to.
If you want to run locally, then you will need these in your $PATH
:
enaDataGet
, which is from enaBrowserTools (have a look inSingularity.def
)mykrobe
.- (also
fastaq
andncbi-genome-download
are required, but are installed when you installmlt
because they are in the requirements file.)
Then run pip install .
from the Python/
directory. The required python
packages will be installed (they are in requirements.txt
).
Alternatively, you could not do pip install, and instead do this:
PYTHONPATH=/path_to/mykrobe-lineage-test/Python /path_to/mykrobe-lineage-test/Python/mlt/__main__.py
That command is equivalent to running the script mlt
.
In short, the process is:
- Put your sample info in a TSV file.
- Download reads using
mlt download_data
- Run mykrobe on all samples using
mlt run_mykrobe
- Make a summary of the results using
mlt summary
.
All the commands need a TSV of sample information. The format is like this (this is made up data!):
accession source species lineage
SRR12345678 ena Mycobacterium_tuberculosis lineage1.2.3
GCF_1234567 genbank Mycobacterium_tuberculosis lineage2.3.4
XY123456 refseq Mycobacterium_tuberculosis lineage3.4
You must have columns accession
, source
, species
and lineage
. They
can be in any order (and any extra columns are ignored). The lineage can
be "NA" if there is no lineage call and you just want to test the species
call.
The source must be ena
, genbank
, or refseq
, and the accession
column
should have the corresponding ENA run ID, or genbank/refseq genome accession.
Since reads are needed for mykrobe, reads are simulated from genomes using
fastaq to_perfect_reads
, making perfect reads (ie no snp/indel errors)
of length 75bp, fragment size 200, and depth 20X.
With a TSV file of samples samples.tsv
in the above format, run:
mlt download_data --cpus 3 samples.tsv Reads
That example downloads 3 samples in parallel. It makes a directory called
Reads
containing the downloaded data. It will (well, should) not crash
on failed downloads, but carry on and get all the samples it can. Check
stderr to see what happened.
You can rerun on an existing directory and it will only try to get data that is missing and skip the samples that are already downloaded. This also means you can do hacks like different sample TSV files run against the same directory of a superset of reads, if you're feeling fancy.
Assuming you have a directory of downloaded reads from mlt download_data
called Reads/
:
mlt run_mykrobe --cpus 10 samples.tsv Reads Results
That will run 10 samples in parallel. It makes a new directory (if it
doesn't exit already) called Results
. As for download_data
, you can
rerun against the same directory and it will only run samples that do not
already have a mykrobe json file of results. It will ignore samples in the TSV
with no reads in Reads/
. It's up to you to use the right TSV file/Reads
directory/results directory - there is no sanity checking. This does allow
for more hacking and testing of samples.
IMPORTANT: the first time a sample is run in Results/
, there is no
skeletons file. If you ask for more than one CPU, the first sample will be
run on its own, making the skeletons file. Then the remaining samples are
run using multiprocessing, since they can then all use the skeletons file,
instead of all trying to make one at the same time and crashing.
There is an option --panels_dir
, which will use that option with mykrobe,
so that you can override the default panels directory and use your own.
You probably want this, since the point here is to test species/lineage calls.
It is not recommended to change the panel and then use an existing results
directory because the skeletons file that is already might be used!
Assuming you have a samples TSV file samples.tsv
, a directory of reads
called Reads/
, and a directory of mykrobe runs called Results/
:
mlt summary samples.tsv Reads Results summary.tsv
That makes a new TSV file called summary.tsv
. It is the same as samples.tsv
,
but with added columns:
called_species
andcalled_lineage
. These are the calls made by mykrobe.correct
: this istrue|false
, showing if the both the called species and lineage were correct. If the expected lineage is "NA", then the true/false call only depends on the species.
Now be good and record the results in the Analysis/
directory and push
to github.