You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've spent the last few days of downtime after Devconnect mulling over how to deal with Rayon and deadlocks (rayon-rs/rayon#592).
We are faced with an acute instance of this problem in #4507 (comment).
Background
Rayon can deadlock if both of the following conditions are true:
Lock L is held while calling into Rayon (e.g. calling rayon::join while holding a mutex).
Lock L is accessed from within Rayon tasks (e.g. calling mutex.lock() within par_iter).
Deadlocks can occur in multiple ways: if the first thread that is holding the lock calls into Rayon and steals some of the jobs of type (2) which require the lock, then those jobs will not complete because of the circular dependency (they are waiting for the thread to release the lock, but the thread is waiting for them to complete because it is waiting for the Rayon call to return). There's a similar variant if the thread holding the lock is itself a Rayon thread, in which case the deadlock occurs because the thread tries to re-lock the lock it already holds.
Observations
Weakening either of the conditions is sufficient to avoid the deadlock. We can either avoid holding locks while calling into Rayon, or avoid obtaining locks from within Rayon tasks.
There is spooky action-at-a-distance. The deadlock is a property of lock usage throughout the entire codebase which makes it difficult to check manually.
We are currently only using Rayon in 2 places: in the op pool for attestation packing, and while tree hashing beacon states.
Strategies
Avoid Rayon
We could use Tokio (or another futures executor?) instead, as tokio::sync::Mutex is explicitly safe to hold across an .await. I had a go at benchmarking tree hashing with Tokio and it was around 3.5x slower (sigp/milhouse#31). We probably don't want to switch tree-hashing out, but could switch out some of the cases where we use Rayon and aren't as driven by performance.
Problematically, Tokio doesn't like it when we spawn compute-heavy tasks on the main executor, so we may need a separate async executor for compute-heavy tasks. Using Tokio's blocking threads is not desirable as then we can't use tokio::sync::Mutex.
Static Analysis
I've opened an issue on the lockbud tool to detect Rayon-related deadlocks (BurtonQin/lockbud#55). The author of lockbud previously used it to find double-locking bugs in Lighthouse, so it would be cool to support the tool, contribute to its development, and run it on CI. Even if we do detect deadlocks using the tool, we still need one of the other strategies to mitigate them.
Avoid Locks
Rayon is only dangerous when locks are involved, so we could refactor code to avoid using locks inside Rayon tasks. This is potentially difficult to do comprehensively without a static analyser, as it can be unclear whether some function is running under Rayon, or may call into Rayon. For tree-hashing it is easy to guarantee that we don't obtain any non-local locks, so it is more obviously safe. In the op pool, we are currently faced with a mess of deeply nested function calls and interactions with the fork choice lock, which are more difficult to eyeball. If we had a persistent and cheap-to-copy data structure for fork choice, then the op pool could use a copy rather than locking.
The text was updated successfully, but these errors were encountered:
Rayon is great for initial parallelization, but it's hard when you hit it's limitations. We are moving away from Rayon at Grandine, but we still use it in some places. A separate Tokio thread pool for CPU intensive tasks is a great option.
Description
I've spent the last few days of downtime after Devconnect mulling over how to deal with Rayon and deadlocks (rayon-rs/rayon#592).
We are faced with an acute instance of this problem in #4507 (comment).
Background
Rayon can deadlock if both of the following conditions are true:
L
is held while calling into Rayon (e.g. callingrayon::join
while holding a mutex).L
is accessed from within Rayon tasks (e.g. callingmutex.lock()
withinpar_iter
).Deadlocks can occur in multiple ways: if the first thread that is holding the lock calls into Rayon and steals some of the jobs of type (2) which require the lock, then those jobs will not complete because of the circular dependency (they are waiting for the thread to release the lock, but the thread is waiting for them to complete because it is waiting for the Rayon call to return). There's a similar variant if the thread holding the lock is itself a Rayon thread, in which case the deadlock occurs because the thread tries to re-lock the lock it already holds.
Observations
Strategies
Avoid Rayon
We could use Tokio (or another futures executor?) instead, as
tokio::sync::Mutex
is explicitly safe to hold across an.await
. I had a go at benchmarking tree hashing with Tokio and it was around 3.5x slower (sigp/milhouse#31). We probably don't want to switch tree-hashing out, but could switch out some of the cases where we use Rayon and aren't as driven by performance.Problematically, Tokio doesn't like it when we spawn compute-heavy tasks on the main executor, so we may need a separate async executor for compute-heavy tasks. Using Tokio's blocking threads is not desirable as then we can't use
tokio::sync::Mutex
.Static Analysis
I've opened an issue on the
lockbud
tool to detect Rayon-related deadlocks (BurtonQin/lockbud#55). The author oflockbud
previously used it to find double-locking bugs in Lighthouse, so it would be cool to support the tool, contribute to its development, and run it on CI. Even if we do detect deadlocks using the tool, we still need one of the other strategies to mitigate them.Avoid Locks
Rayon is only dangerous when locks are involved, so we could refactor code to avoid using locks inside Rayon tasks. This is potentially difficult to do comprehensively without a static analyser, as it can be unclear whether some function is running under Rayon, or may call into Rayon. For tree-hashing it is easy to guarantee that we don't obtain any non-local locks, so it is more obviously safe. In the op pool, we are currently faced with a mess of deeply nested function calls and interactions with the fork choice lock, which are more difficult to eyeball. If we had a persistent and cheap-to-copy data structure for fork choice, then the op pool could use a copy rather than locking.
The text was updated successfully, but these errors were encountered: