v0.4.0
Version 0.4.0
A move to 0.3.x to 0.4.0 is not done lightly. Version 0.4.0 marks a major milestone in the development of lorikeet and with it comes many feature updates that are either polish mechanics of previous releases or brand new features that I hope users will find useful in understanding what lorikeet is doing.
Major changes:
SNP calling: ✨
- Lorikeet now has an inbuilt snp calling algorithm that is paired with freebayes to help extract SNPs for each input sample and help with the guided variant calling
SPEED: 🏃 💨
One of the guiding principals I had in mind when developing lorikeet was speed. Speed is a partial inspiration behind the name "Lorikeet". Lorikeets are strikingly fast birds that tend to fly in groups. Much the same that Lorikeet "flies" in parallel threads. This update reaches what I think is the optimal balance between speed and memory restrictions.
- You can now specify how many genomes to run in parallel.
- Contigs for each genome now run in parallel.
- Multiple iterators have been optimized to better utilize the capabilities of rayon
Progress: 🔢 👀
No longer will you be bombarded by a ridiculous amount of info messages that won't make much sense to anyone but me. Thanks to indicatif
, Lorikeet now has a bunch of fancy progress bars with associated ETA timers which - albeit sometimes inaccurately - provide the user with a better understanding of what is happening under the hood for each sample and each reference in their current run.
Additionally, if a run for whatever reason crashes before completion Lorikeet will now pick up from specific checkpoints and avoid rerunning entire anlayses for a specific genomes. This can be overwritten with the --force
command
Outputs: 👽
An additional file is now output for all major modes that helps tell the user how distant a specific reference might be between samples. The adjacency matrix tells the user how many variants are shared between samples for a specific reference. This will provide output similar to the trees that can be generated by taking the consensus genomes generated by polish
and parsing them to a tool like parsnp
.
Speaking of polish
, a bug has been fixed which prevented the vcf
file being output for any mode other than genotype
Genotyping: 🐀 🐁 🐩 🐕
The genotyping algorithm has seen a bunch of changes. Not all of them will be listed here as it is quite a lot.
- DBSCAN now updates parameters for each reference genome based on whether or not the supplied parameters generate clusters that make sense. i.e. Not every variant can cluster by itself, not all variants can be in the same cluster (usually)
- The read phasing linkage algorithm now happens after DBSCAN. So DBSCAN is seeding the linkage algorithm now. This will provide much the same results as before but at much faster speeds.
In addition, there have been a BUNCH of bug fixes.