Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

"Too Many Open Files" Error When Using Sawfish for Joint Calling #9

Closed
Zoeyoungxy opened this issue Dec 13, 2024 · 3 comments
Closed

Comments

@Zoeyoungxy
Copy link

Hello!
I encountered an issue while using the sawfish v0.12.7 to perform joint calling on 140 HiFi samples with ~30X sequencing depth. Below is the error message I received:
[2024-12-12][10:49:27][sawfish][INFO] Merging SV haplotypes across samples [2024-12-12][13:33:18][sawfish][INFO] Finished merging SV haplotypes across samples [E::hts_idx_load3] Could not load local index file '/home/align/sample54/sample54.pbmm2.sorted.bam.bai': Too many open files thread 'main' panicked at src/worker_thread_data.rs:17:69: calledResult::unwrap()on anErrvalue: BamInvalidIndex { target: "/home/align/sample54/sample54.pbmm2.sorted.bam" } note: run withRUST_BACKTRACE=1environment variable to display a backtrace
I have double-checked that the BAM and BAI files exist and seem to be in good condition.
From the user guide, I noticed that sawfish has been tested successfully on merging data for 47 HPRC samples, but it mentions challenges with larger datasets. Could this issue be due to the large number of samples I am processing? If so, do you have any recommendations or strategies to address this?
Any suggestions or workarounds to resolve this would be greatly appreciated!

Best wishes

@ctsa
Copy link
Member

ctsa commented Dec 13, 2024

Thanks for reporting this.

In general, we haven't written sawfish to scale very well beyond pedigree-like sample counts at this point. The joint-call step runtime will scale non-linearly with sample count, so 140 may be challenging from the runtime perspective alone.

Given this approach to scalability, the current scheme does not scale particularly well with file handles either. It will open n_threads*n_samples bam file handles (+ misc others).

There are many ways we could improve this scalability, it is not clear where we'll want to prioritize this in our upcoming feature efforts but it's not being worked on immediately. For a quick workaround if you'd still like to try this larger joint-sample analysis I'd suggest changing the open file limit before running sawfish using something like this:

ulimit -n 100000

I'll check and see if I can add a similar change programmatically on sawfish startup to help temporarily workaround the high file handle usage.

@ctsa
Copy link
Member

ctsa commented Dec 13, 2024

I want ahead and included an open file limit modification in the latest minor update here:

https://github.com/PacificBiosciences/sawfish/releases/tag/v0.12.8

...this will only help if your system's hard limit allows you to go higher already, but in this case the setting directly in sawfish means you won't need to run a separate ulimit command.

@Zoeyoungxy
Copy link
Author

Thanks for your patience. The guidance really resolved my problem.

-Zoey

@ctsa ctsa closed this as completed Dec 18, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants