-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Make FastaDir thread safe #112
Comments
I have a work in progress that appears to be working well to eliminate these issues. The pysam and sqlite3 c libraries really do not like having the same file open multiple times. There's some race condition between when the file handle object gets destructed and when the next time it is opened, so this only manifests when there are a large quantity of requests that involve accessing the same file via pysam/htslib and sqlite3. I've added locks a fabgz file lock in seqrepo, and explicit destructors that close and destruct open handles. |
I have a different view of what's happening here, and getting to the bottom of that is essential for solving the issue.
By file handles, I assume you mean file descriptors. In what sense do you think that the fds are not thread safe? Threads that have fds open to the same file are thread safe (thanks to modern Unix/Linux and C libs). There should be no contention at that level. |
I now have a script that reliably demonstrates that the problem is with lru_cache in a threaded context, essentially as @theferrit32 conjectured. I'll post the tests later today, but wanted to at least share the results now. I assembled a list of 239980 NM unique accessions and wrote a script that allows me to easily control the # of threads and # of accessions. For some tests, I commented out lru_cache at fastadir.py:212 to disable fd caching in order to assess benefit and threading issues. one thread, lru_cache enabled (current state)
one thread, lru_cache disabledThe upshot is that caching is worth a lot (7418 seq/seq → 232 seq/sec)
two threads, lru_cache enabled
We get two errors because both threads run out of fds at essentially the same time two threads, lru_cache disabled
five threads, lru_cache disabled
My conclusion from all of this is that we must directly address thread-safe caching. To be clear, this isn't a problem with lru_cache per se, or with the inability to use the same fds in multiple threads, but instead that the lru_cache is per-thread, and that this leads us to exhaust fds. This issue is purely a problem with resource allocation. Options I see:
|
@reece thanks for adding those scripts it helps evaluate different conditions easily. I've made a couple modifications and added another README(2) here: https://github.com/theferrit32/biocommons.seqrepo/blob/kf/112-make-fastadir-thread-safe/misc/threading-tests/README2.md
On this point, there are two distinct file descriptor issues that have arisen under load testing.
The performance penalty from disabling the I also added |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
Closing as complete in https://github.com/biocommons/biocommons.seqrepo/releases/tag/0.6.6 |
Hi there! Saw that this was closed but I have experienced a similar issue on 0.6.9. Anecdotally, when using
here's the associated code
Do yall know if this is a related issue? Let me know if I can provide more info |
@quinnwai I believe you are hitting the same issue as issue 1 listed above Each task being executed on your multiprocessing Pool is constructing a new SeqRepo object, which opens new file descriptors without explicitly closing them. They only get closed implicitly when the resources are finally removed from memory by Python's runtime but this can be much later. So you eventually hit an OS error caused by the open file limit being reached. You can use the Something like the below modification works for me. I just moved the from biocommons.seqrepo import SeqRepo
from ga4gh.vrs.dataproxy import SeqRepoDataProxy
from ga4gh.vrs.extras.translator import AlleleTranslator
from datetime import datetime
import multiprocessing
import subprocess
# get vrs ids
def translate(gnomad_expr):
# data_proxy = SeqRepoDataProxy(SeqRepo(seqrepo_path))
# translator = AlleleTranslator(data_proxy)
allele = translator._from_gnomad(gnomad_expr, require_validation=False)
if allele is not None:
return (gnomad_expr, dict(allele))
def calculate_gnomad_expressions(input_vcf, alt=True):
if alt:
command = f"bcftools query -f '%CHROM-%POS-%REF-%ALT\n' {input_vcf}"
else:
command = f"bcftools query -f '%CHROM-%POS-%REF-%REF\n' {input_vcf}"
process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, universal_newlines=True)
# Iterate over the output of bcftools and yield each gnomAD expression
for line in process.stdout:
yield line.strip()
def worker_initializer(seqrepo_path):
global data_proxy
global translator
data_proxy = SeqRepoDataProxy(SeqRepo(seqrepo_path))
translator = AlleleTranslator(data_proxy)
if __name__ == "__main__":
input_vcf = "/Users/kferrite/dev/data/clinvar-20240305.vcf.gz"
seqrepo_path = "/Users/kferrite/dev/biocommons.seqrepo/seqrepo/2024-02-20"
gnomad_generator = calculate_gnomad_expressions(input_vcf)
worker_count = 1 # os.cpu_count()
progress_interval = 10
manager = multiprocessing.Manager()
allele_dict = manager.dict()
with multiprocessing.Pool(
worker_count,
initializer=worker_initializer,
initargs=(seqrepo_path,),
) as pool:
# call the function for each item in parallel
c = 0
print(datetime.now().isoformat(), c)
for result in pool.imap(translate, gnomad_generator):
c += 1
if result:
allele_dict[result[0]] = result[1]
elif c % progress_interval == 0:
print(datetime.now().isoformat(), c) |
Awesome thanks for the help! This is working well excited to use this to process some large VCFs |
The
_open_for_reading
function puts opened FastaFile objects in a memoize cache. These FastaFile objects have open file handles that are not thread safe. This causes issues in applications that use SeqRepo and may attempt to fetch a sequence from the same bgz file from two threads in the same process, via the same in-processSeqRepo
object. This synchronization bug and the exception it can lead to are particularly difficult to debug in Python web servers which [deleted rant about python threading]biocommons.seqrepo/src/biocommons/seqrepo/fastadir/fastadir.py
Lines 201 to 204 in 9ce861b
So FastaFile objects are inherently not thread safe, because they logically represent (and contain) a C
open()
ed descriptor and are seeked on directly by the htslib functions called by FastaFile and elsewhere in pysam.The file handle struct in htslib that each FastaFile has an instantiation of is here:
https://github.com/samtools/htslib/blob/6143086502567c5c4bb5cacb2951f664ba28ed6e/hfile.c#L524-L528
We do not want to eliminate an open FastaFile cache in seqrepo because of the overhead involved in opening and closing files. We could make these FastaFile objects lockable with
contextmanager
and require FastaDir to acquire the lock before using.fetch
method. Like here:biocommons.seqrepo/src/biocommons/seqrepo/fastadir/fastadir.py
Lines 121 to 122 in 9ce861b
The downside to this is it introduces overhead for acquiring and releasing the mutex even in single-threaded programs. The question then is whether this overhead is lower than the overhead opening/closing files (it almost certainly is). If using contextmanager does introduce problematic overhead, we could have something like an environment variable to disable thread safety, and applications will need to ensure they use one
SeqRepo
object per thread.It also means we have to remember to always acquire the mutex on the object returned from
_open_for_reading
if we want to do something stateful with it. As far as I can tell though this function is only used internally infastadir.py
, so that's easy. If we go this route we should _explicitly document that this function should not be used externally, and if it is, then thread safety needs to be considered.The text was updated successfully, but these errors were encountered: