Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

bit_kmers function will panic at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. #58

Open
yiolino opened this issue Mar 29, 2022 · 7 comments

Comments

@yiolino
Copy link

yiolino commented Mar 29, 2022

Thank you for great software.

I want to use bit_kmer function to count k-mer in fastq file.
But, if k-mer length is longer than 32 bases, it will panic.

I think this is because the bit_kmer sequence is represented as a u64 type (type BitKmerSeq = u64), but is there a method to perform k-mer counts over 32 bp?

I am using HashMap as a database for k-mer counts, but if I use fastq files as input, HashMap becomes too large.
So, I would like to use bit_kmer to reduce it as much as possible. Is there an alternative method that could be considered, such as "use u128 type"?

Regards,
tetsuro90

@yiolino yiolino changed the title bit_kmers function will panick at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. bit_kmers function will panic at 'attempt to multiply with overflow', if k-mer length is longer than 32 bp. Mar 29, 2022
@Keats
Copy link
Contributor

Keats commented Mar 29, 2022

Hi,

We probably are going to revamp the bit kmers we have as we just built a pretty huge project using needletail and ended up re-implementing the bit encoding to be more flexible.
You're right that it would need u128 instead of u64 for storing kmers > 31bp. For now I would recommend writing your own 2bit encoding function returning a u128 and do not use the built-in bit kmers iterator.

@yiolino
Copy link
Author

yiolino commented Mar 29, 2022

@Keats
Thank you for your reply.

I understand and looking forward to re-implementing the bit encoding!

Regards,

@natir
Copy link

natir commented Mar 29, 2022

You can check kmers bit encoding https://github.com/COMBINE-lab/kmers

@Keats
Copy link
Contributor

Keats commented Mar 29, 2022

I will have to take a look at that @natir !
In our program we have some stuff that wouldn't make sense in a public library (eg we do some 3 bit encoding as well to encode ATCG$). I'm leaning toward adding some basic encoding utilities to needletail and let people do whatever they need on the raw sequence rather than having a built-in opinionated iterator.

@yiolino
Copy link
Author

yiolino commented Mar 29, 2022

@natir
I'll use that. Thank you!

@natir
Copy link

natir commented Mar 29, 2022

Sure @Keats my message is more for @tetsuro90 than you, maybe kmers match to @tetsuro90 requirement.

@Keats
Copy link
Contributor

Keats commented Mar 29, 2022

Yeah I understood, I just meant that I ned to look at kmers before doing changes to needletail to see if we can consolidate somehow

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants