Decide on a policy for Rayon usage #4952

michaelsproul · 2023-11-27T01:40:26Z

Description

I've spent the last few days of downtime after Devconnect mulling over how to deal with Rayon and deadlocks (rayon-rs/rayon#592).

We are faced with an acute instance of this problem in #4507 (comment).

Background

Rayon can deadlock if both of the following conditions are true:

Lock L is held while calling into Rayon (e.g. calling rayon::join while holding a mutex).
Lock L is accessed from within Rayon tasks (e.g. calling mutex.lock() within par_iter).

Deadlocks can occur in multiple ways: if the first thread that is holding the lock calls into Rayon and steals some of the jobs of type (2) which require the lock, then those jobs will not complete because of the circular dependency (they are waiting for the thread to release the lock, but the thread is waiting for them to complete because it is waiting for the Rayon call to return). There's a similar variant if the thread holding the lock is itself a Rayon thread, in which case the deadlock occurs because the thread tries to re-lock the lock it already holds.

Observations

Weakening either of the conditions is sufficient to avoid the deadlock. We can either avoid holding locks while calling into Rayon, or avoid obtaining locks from within Rayon tasks.
There is spooky action-at-a-distance. The deadlock is a property of lock usage throughout the entire codebase which makes it difficult to check manually.
We are currently only using Rayon in 2 places: in the op pool for attestation packing, and while tree hashing beacon states.

Strategies

Avoid Rayon

We could use Tokio (or another futures executor?) instead, as tokio::sync::Mutex is explicitly safe to hold across an .await. I had a go at benchmarking tree hashing with Tokio and it was around 3.5x slower (sigp/milhouse#31). We probably don't want to switch tree-hashing out, but could switch out some of the cases where we use Rayon and aren't as driven by performance.

Problematically, Tokio doesn't like it when we spawn compute-heavy tasks on the main executor, so we may need a separate async executor for compute-heavy tasks. Using Tokio's blocking threads is not desirable as then we can't use tokio::sync::Mutex.

Static Analysis

I've opened an issue on the lockbud tool to detect Rayon-related deadlocks (BurtonQin/lockbud#55). The author of lockbud previously used it to find double-locking bugs in Lighthouse, so it would be cool to support the tool, contribute to its development, and run it on CI. Even if we do detect deadlocks using the tool, we still need one of the other strategies to mitigate them.

Avoid Locks

Rayon is only dangerous when locks are involved, so we could refactor code to avoid using locks inside Rayon tasks. This is potentially difficult to do comprehensively without a static analyser, as it can be unclear whether some function is running under Rayon, or may call into Rayon. For tree-hashing it is easy to guarantee that we don't obtain any non-local locks, so it is more obviously safe. In the op pool, we are currently faced with a mess of deeply nested function calls and interactions with the fork choice lock, which are more difficult to eyeball. If we had a persistent and cheap-to-copy data structure for fork choice, then the op pool could use a copy rather than locking.

The text was updated successfully, but these errors were encountered:

sauliusgrigaitis · 2023-11-28T08:31:28Z

Rayon is great for initial parallelization, but it's hard when you hit it's limitations. We are moving away from Rayon at Grandine, but we still use it in some places. A separate Tokio thread pool for CPU intensive tasks is a great option.

winksaville · 2023-11-28T15:14:53Z

I personally like using channels/messaging to facilitate parallelism.

michaelsproul added bug Something isn't working RFC Request for comment code-quality labels Nov 27, 2023

michaelsproul mentioned this issue Nov 28, 2023

Bron-Kerbosch attestation aggregation #4507

Open

michaelsproul mentioned this issue Jan 23, 2024

Iterate in parallel sigp/milhouse#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide on a policy for Rayon usage #4952

Decide on a policy for Rayon usage #4952

michaelsproul commented Nov 27, 2023 •

edited

Loading

sauliusgrigaitis commented Nov 28, 2023

winksaville commented Nov 28, 2023

Decide on a policy for Rayon usage #4952

Decide on a policy for Rayon usage #4952

Comments

michaelsproul commented Nov 27, 2023 • edited Loading

Description

Background

Observations

Strategies

Avoid Rayon

Static Analysis

Avoid Locks

sauliusgrigaitis commented Nov 28, 2023

winksaville commented Nov 28, 2023

michaelsproul commented Nov 27, 2023 •

edited

Loading