-
Notifications
You must be signed in to change notification settings - Fork 158
Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Conversation
4337a0b
to
0afe14c
Compare
// (like an extreme network latency). | ||
locked = runLocks.tryLock(runId, 1, TimeUnit.SECONDS); | ||
|
||
if (!locked) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change behind this line in this class is refactoring which makes the logic easier to follow and it may be ignored during the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be changed to the workflow task timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for confirming that this 1 sec looks off! Let's discuss more and change it. While important to do, unrelated to the scope of this PR, we can do it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change it to the workflow task timeout, the purpose is just to fail the workflow task on worker side? (I mean, from user perspective, this task is timed out by server anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more to actually give a chance for the workflow task in flight to finish. Workflow task timeout is about 10s, deadlock detector is 1 second by default, seconds typically. There is more room to wait than 1 second typically and an opportunity to not fail a query request.
if (lockData.count == 0) { | ||
perRunLock.remove(runId); | ||
// it's important to signal all threads, | ||
// otherwise n-1 of them will stuck waiting on a condition that is not in the map already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// otherwise n-1 of them will stuck waiting on a condition that is not in the map already | |
// otherwise n-1 of them will be stuck waiting on a condition that is not in the map already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to review this without more knowledge of Java SDK internals.
If 1 second refers to net workflow code processing time it sounds like enough but if that includes data converters and local activities, either of which could do I/O then that is definitely insufficient.
What was changed
WorkflowRunLockManager was reimplemented in a safe way. Now unlock for a runId can be called only from a thread that acquired the lock.
Why?
See #1146. The current API and implementation of WorkflowRunLockManager may lead to a lock for a runId being released by a thread that is not eligible for that.
Closes #1146