Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Detect Rayon deadlocks #55

Open
michaelsproul opened this issue Nov 26, 2023 · 0 comments
Open

Detect Rayon deadlocks #55

michaelsproul opened this issue Nov 26, 2023 · 0 comments

Comments

@michaelsproul
Copy link

Rayon has this footgun for lock usage where it can deadlock if both of the following conditions are true:

  1. Lock L is held while calling into Rayon (e.g. calling rayon::join while holding a mutex).
  2. Lock L is accessed from within Rayon tasks (e.g. calling mutex.lock() within par_iter).

Deadlocks can occur in multiple ways: if the first thread that is holding the lock calls into Rayon and steals some of the jobs of type (2) which require the lock, then those jobs will not complete because of the circular dependency (they are waiting for the thread to release the lock, but the thread is waiting for them to complete because it is waiting for the Rayon call to return). There's a similar variant if the thread holding the lock is itself a Rayon thread, in which case the deadlock occurs because the thread tries to re-lock the lock it already holds.

How could lockbud help? It would be prone to false positives, but it could detect any calls to Rayon functions that occur while holding a Mutex/RwLock. I think this is similar to detecting double-locking.

Unfortunately it seems that a deadlock-avoiding mutex for Rayon is not likely to be implemented any time soon. The naive approach of yielding when locking fails does not work because there is no "task" that can be yielded to which will unblock execution – we need the current task to finish to return control to the caller (see rayon-rs/rayon#592 (comment)). Unlike in async code, there is no way to interrupt a Rayon task while it is running, because it is just an ordinary function. This suggests it also wouldn't be feasible to port tokio::sync::Mutex which uses wakers and other futures-specific features to work around this problem in the async context (see https://github.com/tokio-rs/tokio/blob/2400769b54fd31186b07bfbec277e846e8d64c16/tokio/src/sync/batch_semaphore.rs#L2-L17).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant