-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
improving parallelism when mapping very long sequences #491
Comments
Is
|
@tseemann It's not, but the indexing wasn't the problem. That takes ten minutes or so at maximum. It looks like there is a integer truncation in the option parsing that will limit the batch size to 2^31. The parsing of the option returns a 64-bit integer, but it's cast to a plain int. After looking at this again, I'm not sure how much utility there is in fixing this. I was unable to allocate enough memory at 2000M (which doesn't overflow and so can be specified via I was able to improve throughput slightly by sorting my sequences from largest to smallest. This results in most of the sequences in a given batch having approximately the same runtime, and so we at least achive parallelism on average close to the number of sequences. This could be in the 1s and 10s for really long sequences, but it's better than 1 which tends to be the result when the order is randomized. When the job starts to run into the smaller sequences the throughput goes up significantly, and so things complete pretty efficiently. I would think that much of the alignment of long sequences is basically serial. To me this suggests that it wouldn't be hard to make alignment of long sequences parallel, by completing the time consuming steps like traceback/cigar generation in parallel. This would probably require restructuring the alignment algorithm. With whole genome alignment becoming more important, it's worth thinking about how to do this. Perhaps a different algorithm is needed for such applications. |
Ah yes, this that it got from here: The struct only uses an int. should be
|
I guess it should be. The memory requirements for extremely long reads are prohibitive. This could be mitigated by using temporary files or memory mapping to hold parts of the alignment that aren't being worked on currently. @lh3 it's not that we need to store the entire alignment structure in memory when doing operations like traceback? |
@tseemann I am aware of the integer overflow issue. However, changing the integer in @ekg minimap2 keeps all intermediate data (including the full trackback matrix) in memory and doesn't produce temporary files unless |
I'm mapping some long sequences (pseudo-unitigs from an assembly graph) to the human reference genome. It seems that some sequences are taking a very long time. Almost all of my runtime (>80%) is spent running single-threaded while waiting for minimap2 to finish mapping the longest contigs in each batch.
I've attempted to increase the batch size to 8G by setting
-K 8000M
, but this seems to be making minimap2 "hang" from my perspective. After setting up its minimizer index, It's sat for 15 minutes without increasing its memory or using more than a single thread.For reference, here's my command line:
Can minimap2 be convinced to pick up a larger set of reads in the minibatch (am I doing this with
-K
?), or just pick up each read and align it one at a time?The text was updated successfully, but these errors were encountered: