Skip to content

[SPARK-40235][CORE] Use interruptible lock instead of synchronized in Executor.updateDependencies() #37681

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed

Conversation

JoshRosen
Copy link
Contributor

What changes were proposed in this pull request?

This patch modifies the synchronization in Executor.updateDependencies() in order to allow tasks to be interrupted while they are blocked and waiting on other tasks to finish downloading dependencies.

This synchronization was added years ago in mesos/spark@7b9e96c in order to prevent concurrently-launching tasks from performing concurrent dependency updates. If one task is downloading dependencies, all other newly-launched tasks will block until the original dependency download is complete.

Let's say that a Spark task launches, becomes blocked on a updateDependencies() call, then is cancelled while it is blocked. Although Spark will send a Thread.interrupt() to the canceled task, the task will continue waiting because threads blocked on a synchronized won't throw an InterruptedException in response to the interrupt. As a result, the blocked thread will continue to wait until the other thread exits the synchronized block. 

This PR aims to fix this problem by replacing the synchronized with a ReentrantLock, which has a lockInterruptibly method.

Why are the changes needed?

In a real-world scenario, we hit a case where a task was canceled right after being launched while another task was blocked in a slow library download. The slow library download took so long that the TaskReaper killed the executor because the canceled task could not exit in a timely fashion. This patch's fix prevents this issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit test case.

@JoshRosen JoshRosen added the CORE label Aug 26, 2022
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix !

Thoughts on doing something similar for TransportClientFactory.createClient ?
Any other place where this pattern might be relevant ?

@JoshRosen
Copy link
Contributor Author

Thoughts on doing something similar for TransportClientFactory.createClient ? Any other place where this pattern might be relevant ?

Good catch: I agree that we should fix this for TransportClientFactory.createClient, too. In fact, it turns out that we used to have a different interruption / cancellation bug in that code at #16866, so this is definitely worth fixing.

I'd like to do that in a separate followup PR, though, since I'm still thinking through some details of how/if I want to test that other change. Filed https://issues.apache.org/jira/browse/SPARK-40263 for that followup.

I skimmed through other uses of synchronized and didn't spot any other obvious problems. Beyond that, I think that LoadingCache could be another source of uninterruptible tasks: calls that are blocked in .get() on a LoadingCache will be ininterruptible (see google/guava#1122 and https://github.com/google/guava/wiki/CachesExplained). Of the places where we use LoadingCache, I think the only potentially slow one might be CodeGenerator. I think it's unlikely that code generation would take 1+ minutes, though. Therefore I think that this PR and the followup that I'll open for TransportClientFactory.createClient factory should largely resolve this class of issue where an uninterruptible task triggers the TaskReaper.

I'm going to merge this to master and will aim to get a followup PR open soon.

@JoshRosen JoshRosen closed this in 295dd57 Aug 29, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants