Skip to content

Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 2 commits into from
Apr 19, 2022

Conversation

Spikhalskiy
Copy link
Contributor

@Spikhalskiy Spikhalskiy commented Apr 17, 2022

What was changed

WorkflowRunLockManager was reimplemented in a safe way. Now unlock for a runId can be called only from a thread that acquired the lock.

Why?

See #1146. The current API and implementation of WorkflowRunLockManager may lead to a lock for a runId being released by a thread that is not eligible for that.

Closes #1146

@Spikhalskiy Spikhalskiy force-pushed the issue-1146 branch 6 times, most recently from 4337a0b to 0afe14c Compare April 17, 2022 22:36
// (like an extreme network latency).
locked = runLocks.tryLock(runId, 1, TimeUnit.SECONDS);

if (!locked) {
Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Apr 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change behind this line in this class is refactoring which makes the logic easier to follow and it may be ignored during the review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be changed to the workflow task timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming that this 1 sec looks off! Let's discuss more and change it. While important to do, unrelated to the scope of this PR, we can do it separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change it to the workflow task timeout, the purpose is just to fail the workflow task on worker side? (I mean, from user perspective, this task is timed out by server anyway)

Copy link
Contributor Author

@Spikhalskiy Spikhalskiy Apr 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more to actually give a chance for the workflow task in flight to finish. Workflow task timeout is about 10s, deadlock detector is 1 second by default, seconds typically. There is more room to wait than 1 second typically and an opportunity to not fail a query request.

if (lockData.count == 0) {
perRunLock.remove(runId);
// it's important to signal all threads,
// otherwise n-1 of them will stuck waiting on a condition that is not in the map already
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// otherwise n-1 of them will stuck waiting on a condition that is not in the map already
// otherwise n-1 of them will be stuck waiting on a condition that is not in the map already

Copy link
Member

@bergundy bergundy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to review this without more knowledge of Java SDK internals.

If 1 second refers to net workflow code processing time it sounds like enough but if that includes data converters and local activities, either of which could do I/O then that is definitely insufficient.

@Spikhalskiy Spikhalskiy merged commit 48bdd4d into temporalio:master Apr 19, 2022
@Spikhalskiy Spikhalskiy deleted the issue-1146 branch April 19, 2022 15:03
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code using WorkflowRunLockManager may release locks that are owned by other threads
5 participants