Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

kmc_tools simple intersect & union, not considering reverse complement? #227

Open
zhaotao1987 opened this issue Jan 13, 2024 · 3 comments
Open

Comments

@zhaotao1987
Copy link

Hello, in my project I would like to just calculate the kmer counts for one side of the reads, read_1.fq.gz, for example.
And then afterwards, I would like to compare kmer.dump across samples, I would like to calculate shared kmers relative to total unique kmers, to this end, considering reverse complement is important, but I found kmc_tools intersect does not consider reverse complemented kmers between samples? Or, I have to do kmer calculation for both reads for each sample, then compare, that is the only way?
Thanks very much.

@marekkokot
Copy link
Contributor

Hi,

I am not fully understanding.
In general, kmc_tools is unaware of strands (at least as far as I remember).
It just takes k-mers as they are in the kmc database.
Now, when you run kmc, you may count either:

  • k-mers, regardless of their reverse complement or
  • canonical k-mers, i.e., each k-mer is transformed to reverse complement, and the lower of these two (original and rev. comply) is chosen.
    By default, canonical k-mers are used—if you want the first option, use the -b switch for kmc run.
    If you want all the k-mers in the reverse form, it's not currently supported. We may consider adding this. For now, you could reverse all input reads and use this as a kmc input and run kmc with -b.
    Let me know if that is helpful, and maybe try to be more specific about what you want to achieve (maybe with some trivial example, a couple of samples, input files, sequences, and k-mers you want).

@zhaotao1987
Copy link
Author

Hi Marek,

Thank you for your prompt response; much appreciated. Great to learn that by default, KMC already considers reverse complement (RC).

I'm currently working on improvise something and would like to share my approach with you, seeking your input on its viability. I'm utilizing resequencing data from 300 samples, encompassing various species/subspecies within Malus. My goal is to explore the phylogenetic relationships among these samples based on kmer similarities, first to generate a kmer similarity matrix for all samples. The steps involve kmer counting for each sample and kmer comparison for every pair of samples ( kmc -> kmc_dump -> kmer_comparison.py ). To optimize efficiency in terms of time and space, I'm contemplating using only the R1 read for each sample.

So learned from what you said, if I used the -b mode, RC kmers from different samples might be treated as distinct kmers, thus introducing bias into the results. To clarify, for each comparison, I'm calculating shared (unique) kmers relative to total (unique) kmers. I'm pleased to learn that the software inherently considers both original and RC simultaneously by default. (oh, maybe still problematic... since the lower of these two was chosen , it may varied across samples.. ), I seems to me that kmc_tools simple intersection function does not consider RC?

Moving on to another aspect of my project, I have a question about the appropriate values for -ci and -cx when comparing shared kmers between samples. I think -ci should be set at 2 to keep all shared ones and discard error-prone kmers occurring only once. As for -cx, I've used 1000, but I'd appreciate your thoughts on this choice. The kmer size is 21.

Thank you once again for your assistance.
BTW, I am not using kmc_tools intersection and union for calculating the ratio, which generates unnecessary big files. The python script is what I used for this purpose.

@zhaotao1987
Copy link
Author

kmer_comparison.1.zip

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants