Releases: rhysnewell/Lorikeet
v0.8.2
What's Changed
- Doco fixes by @wwood in #54
- Lower mem usage by @rhysnewell in #55
- Dev by @rhysnewell in #56
Full Changelog: v0.8.1...v0.8.2
v0.8.1
What's Changed
- Merge 'Dev' in to "master" by @rhysnewell in #49
- Compilation fix by @wwood in #50
- Fix VCF annotations by @rhysnewell in #51
- Dev to main by @rhysnewell in #53
Full Changelog: v0.8.0...v0.8.1
v0.8.0
What's Changed
- Catching dev up to master branch by @rhysnewell in #46
- cli: Allow --profile very-fast. by @wwood in #47
Full Changelog: v0.7.3...v0.8.0
v0.7.3
fix: release workflow copying old minimap2 header binary
v0.7.2
fix: fst calculations are now ploidy agnostic
v0.7.2rc1
fix: new releases are tagged correctly
Development build: master
pre-release_master Update pre-release-lorikeet.yml
v0.6.0rc2
Version 0.6.0 - release candidate 2
This release candidate reintroduces consensus genome calling and strain genome discovery.
It also updates the linkage algorithm from previous versions, now utilizing a more sophisticated graph based approach for linking clusters
v0.6.0rc1
v0.6.0 Release Candidate 1
This release introduces the completely overhauled variant calling setup for Lorikeet. No longer does lorikeet rely on threshold based variant calling approaches, and instead takes a more sophisticated approach utilising local re-assembly of active regions. This release includes a reimplementation of the GATK HaplotypeCaller algorithm but in Rust, so hopefully it is faster. It will be at least be easier to parse multiple genomes + samples into the algorithm at once to generate called variants.
Currently, the strain resolving part of lorikeet is hidden and will be re-enabled ASAP.
The HaplotypeCaller algorithm involves breaking up genomes into potential active regions and then performing local re-assembly with the reads that mapped to those locations. The local assembly is then searched for potential haplotypes using a number of techniques and candidate haplotypes are assigned likelihoods using a pairwise HMM model to re-assign reads to the haplotypes. Ultimately, the HaplotypeCaller algorithm produces sets of high confidence variants with depths across samples.
The HaplotypeCaller code was re-implemented in Rust in order to potentially speed up the variant calling process, make it easier to parse multiple genomes and samples into the algorithm, and hopefully make use of some of the code base in future projects and in the strain resolving pipeline.
The code requires benchmarking, but early indications from tests and small datasets puts the Lorikeet variant calling speed on par with the Java implementation. I believe the real speed up will appear when multiple genomes are supplied to Lorikeet as they will be run in parallel seamlessly.
Additionally, a number of code clean-ups should be implemented as soon as possible. Primarily around the BirdToolRead
, SequencesForKmers
, and Kmers
data structures. Currently, accessing the bytes within a read requires cloning the data with no option to create a reference pointing the data (without the added complexity of decoding every encoded base). This means SequencesForKmers
and Kmers
each hold a clone of the read bases which is very costly. I believe by adding a bases
field to BirdToolRead
that is updated when the underlying Read
is changed, we can change those clones to be references and wrangle with the lifetimes to significantly speed up the graph building stage of the algorithm.
TODO:
Reimplement strain calling + abundance estimation
Reimplement consensus calling
Update README
Update Workflow image
Various code improvements
Revised genotyping
So, in keeping with tradition this release brings a bunch of changes to Lorikeet that make it pretty distant from where it was a month ago. I know only a few people are trying to keep track of all changes that keep being made here, and I'm sorry things are so stochastic. I think the words of my supervisor put it best when I told him about one of the changes I had made... "Ah, so freebayes is out this week, huh?"
Yeah, freebayes is out. Cancelled. For generating illegal instructions
and segmentation fault
on GPU nodes. I ain't fixing that, I'll just make my own variant caller.
Lorikeet's new best friends are UMAP and HDBSCAN. The curse of dimensionality hexed me pretty good during benchmarking, so UMAP is being used for dimensionality reduction. I chose it over PCA since it seems to discriminate grouping of variants way better. Also, since we now have to use a python library for UMAP, might as well upgrade fuzzy DBSCAN to it's better version: HDBSCAN
Changes:
- Freebayes. OUT.
- Fuzzy DBSCAN. OUT.
- UMAP. IN.
- HDBSCAN. IN.
- Evolve now reports per sample dNdS and coverage values for each ORF