[SPARK-40235][CORE] Use interruptible lock instead of synchronized in Executor.updateDependencies() #37681

JoshRosen · 2022-08-26T23:15:03Z

What changes were proposed in this pull request?

This patch modifies the synchronization in Executor.updateDependencies() in order to allow tasks to be interrupted while they are blocked and waiting on other tasks to finish downloading dependencies.

This synchronization was added years ago in mesos/spark@7b9e96c in order to prevent concurrently-launching tasks from performing concurrent dependency updates. If one task is downloading dependencies, all other newly-launched tasks will block until the original dependency download is complete.

Let's say that a Spark task launches, becomes blocked on a updateDependencies() call, then is cancelled while it is blocked. Although Spark will send a Thread.interrupt() to the canceled task, the task will continue waiting because threads blocked on a synchronized won't throw an InterruptedException in response to the interrupt. As a result, the blocked thread will continue to wait until the other thread exits the synchronized block.

This PR aims to fix this problem by replacing the synchronized with a ReentrantLock, which has a lockInterruptibly method.

Why are the changes needed?

In a real-world scenario, we hit a case where a task was canceled right after being launched while another task was blocked in a slow library download. The slow library download took so long that the TaskReaper killed the executor because the canceled task could not exit in a timely fashion. This patch's fix prevents this issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test case.

mridulm

Nice fix !

Thoughts on doing something similar for TransportClientFactory.createClient ?
Any other place where this pattern might be relevant ?

core/src/main/scala/org/apache/spark/executor/Executor.scala

JoshRosen · 2022-08-29T23:46:53Z

Thoughts on doing something similar for TransportClientFactory.createClient ? Any other place where this pattern might be relevant ?

Good catch: I agree that we should fix this for TransportClientFactory.createClient, too. In fact, it turns out that we used to have a different interruption / cancellation bug in that code at #16866, so this is definitely worth fixing.

I'd like to do that in a separate followup PR, though, since I'm still thinking through some details of how/if I want to test that other change. Filed https://issues.apache.org/jira/browse/SPARK-40263 for that followup.

I skimmed through other uses of synchronized and didn't spot any other obvious problems. Beyond that, I think that LoadingCache could be another source of uninterruptible tasks: calls that are blocked in .get() on a LoadingCache will be ininterruptible (see google/guava#1122 and https://github.com/google/guava/wiki/CachesExplained). Of the places where we use LoadingCache, I think the only potentially slow one might be CodeGenerator. I think it's unlikely that code generation would take 1+ minutes, though. Therefore I think that this PR and the followup that I'll open for TransportClientFactory.createClient factory should largely resolve this class of issue where an uninterruptible task triggers the TaskReaper.

I'm going to merge this to master and will aim to get a followup PR open soon.

Changes.

02157be

JoshRosen added the CORE label Aug 26, 2022

JoshRosen added 2 commits August 26, 2022 16:15

Update Executor.scala

e50e999

Update ExecutorSuite.scala

2f12ec5

mridulm approved these changes Aug 27, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/executor/Executor.scala Outdated Show resolved Hide resolved

private[executor]

db01b6d

JoshRosen closed this in 295dd57 Aug 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40235][CORE] Use interruptible lock instead of synchronized in Executor.updateDependencies() #37681

[SPARK-40235][CORE] Use interruptible lock instead of synchronized in Executor.updateDependencies() #37681

JoshRosen commented Aug 26, 2022

mridulm left a comment •

edited

Loading

JoshRosen commented Aug 29, 2022

[SPARK-40235][CORE] Use interruptible lock instead of synchronized in Executor.updateDependencies() #37681

[SPARK-40235][CORE] Use interruptible lock instead of synchronized in Executor.updateDependencies() #37681

Conversation

JoshRosen commented Aug 26, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

mridulm left a comment • edited Loading

Choose a reason for hiding this comment

JoshRosen commented Aug 29, 2022

mridulm left a comment •

edited

Loading