-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
os.listdir
(and presumably other os functions) fails in the face of signals
#955
Comments
Hey @emeryberger, thanks for reporting this issue and sorry for the delay in the response. We've reproduced it and are able to confirm that the problem exists. Here is a debug log:
From the log we can see that when an application gets interrupted in a A separate problem are possible errors caused by interrupts during other system calls, for which we don't have a reproduction yet. I've created a separate issue for that. |
## Description of change When user application gets interrupted in a `readdir` syscall the underlying chain of `readdir` fuse requests gets reset to an offset which is considered stale by Mountpoint. In that case Mountpoint still completes the interrupted `readdir` request, but kernel partially discards the response. We already cache the last response, so we can use it to serve the request which follows the interrupt. Relevant issues: #955 ## Does this change impact existing behavior? This is not a breaking change. Previously an error was returned, now it'll be handled properly. --- By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and I agree to the terms of the [Developer Certificate of Origin (DCO)](https://developercertificate.org/). --------- Signed-off-by: Vladislav Volodkin <vlaad@amazon.co.uk> Signed-off-by: Vlad Volodkin <vlaad@amazon.com> Co-authored-by: Vladislav Volodkin <vlaad@amazon.co.uk> Co-authored-by: Vlad Volodkin <vlaad@amazon.com>
Mountpoint for Amazon S3 version
mount-s3 1.7.2
AWS Region
us-east-1
Describe the running environment
Running on EC2, accessing an S3 bucket through my account, using this AMI:
Deep Learning Base Proprietary Nvidia Driver GPU AMI (Ubuntu 20.04) 20240314
. Same setup as here: plasma-umass/scalene#841Mountpoint options
What happened?
This error (failure when running in a mounted S3 system) was brought to my attention with this issue with the Scalene profiler: plasma-umass/scalene#841
The root cause turns out to be the CPU timer signal; if the
os.listdir
function is interrupted by asignal.SIGALRM
, the call fails with anOSError
. I set the frequency below to a level that triggers the failure roughly half the time; setting it to 1 second makes it never happen. Since the default CPU sampling frequency used by Scalene is 0.01 seconds, it fails consistently. Note that the profilerpy-spy
also causes this failure.MRE here:
Example failure:
I have implemented a workaround for this situation (wrapping all
os
functions so that they block theSIGALRM
signal) - plasma-umass/scalene#842 - , but it seems like it is exposing a race condition in mount-s3, or at minimum, undesirable behavior.Relevant log output
2024-07-29T00:47:59.230133Z WARN mountpoint_s3::cli: failed to detect network throughput. Using 10 gbps as throughput. Use --maximum-throughput-gbps CLI flag to configure a target throughput appropriate for the instance. Detection failed due to: failed to get network throughput 2024-07-29T00:48:15.393878Z WARN readdirplus{req=10 ino=1 fh=1 offset=1}: mountpoint_s3::fuse: readdirplus failed: out-of-order readdir, expected=3, actual=1
The text was updated successfully, but these errors were encountered: