Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147

Spikhalskiy · 2022-04-17T01:08:29Z

What was changed

WorkflowRunLockManager was reimplemented in a safe way. Now unlock for a runId can be called only from a thread that acquired the lock.

Why?

See #1146. The current API and implementation of WorkflowRunLockManager may lead to a lock for a runId being released by a thread that is not eligible for that.

Closes #1146

Spikhalskiy · 2022-04-17T22:38:46Z

temporal-sdk/src/main/java/io/temporal/internal/worker/WorkflowWorker.java

+        //   (like an extreme network latency).
+        locked = runLocks.tryLock(runId, 1, TimeUnit.SECONDS);
+
+        if (!locked) {


The change behind this line in this class is refactoring which makes the logic easier to follow and it may be ignored during the review.

I think it should be changed to the workflow task timeout.

Thanks for confirming that this 1 sec looks off! Let's discuss more and change it. While important to do, unrelated to the scope of this PR, we can do it separately.

If we change it to the workflow task timeout, the purpose is just to fail the workflow task on worker side? (I mean, from user perspective, this task is timed out by server anyway)

It's more to actually give a chance for the workflow task in flight to finish. Workflow task timeout is about 10s, deadlock detector is 1 second by default, seconds typically. There is more room to wait than 1 second typically and an opportunity to not fail a query request.

bergundy · 2022-04-18T15:03:48Z

temporal-sdk/src/main/java/io/temporal/internal/worker/WorkflowRunLockManager.java

+        if (lockData.count == 0) {
+          perRunLock.remove(runId);
+          // it's important to signal all threads,
+          // otherwise n-1 of them will stuck waiting on a condition that is not in the map already


Suggested change

// otherwise n-1 of them will stuck waiting on a condition that is not in the map already

// otherwise n-1 of them will be stuck waiting on a condition that is not in the map already

bergundy

Hard to review this without more knowledge of Java SDK internals.

If 1 second refers to net workflow code processing time it sounds like enough but if that includes data converters and local activities, either of which could do I/O then that is definitely insufficient.

temporal-sdk/src/main/java/io/temporal/internal/worker/WorkflowRunLockManager.java

…e unlock Issue temporalio#1146

Spikhalskiy requested review from mfateev, Sushisource, cretz, bergundy and mmcshane as code owners April 17, 2022 01:08

Spikhalskiy force-pushed the issue-1146 branch 6 times, most recently from 4337a0b to 0afe14c Compare April 17, 2022 22:36

Spikhalskiy commented Apr 17, 2022

View reviewed changes

bergundy reviewed Apr 18, 2022

View reviewed changes

Spikhalskiy force-pushed the issue-1146 branch from 0afe14c to 782ee18 Compare April 18, 2022 15:05

bergundy reviewed Apr 18, 2022

View reviewed changes

cretz reviewed Apr 18, 2022

View reviewed changes

Spikhalskiy force-pushed the issue-1146 branch from 782ee18 to 8e45050 Compare April 18, 2022 17:22

Reimplemented WorkflowRunLockManager to fix design flaw with an unsaf…

e563477

…e unlock Issue temporalio#1146

Spikhalskiy force-pushed the issue-1146 branch from 8e45050 to e563477 Compare April 18, 2022 18:09

Rework WorkflowRunLockManager to concurrentHashMap approach

5a26864

Spikhalskiy force-pushed the issue-1146 branch from 07f06d8 to 5a26864 Compare April 18, 2022 20:44

cretz approved these changes Apr 18, 2022

View reviewed changes

Spikhalskiy merged commit 48bdd4d into temporalio:master Apr 19, 2022

Spikhalskiy deleted the issue-1146 branch April 19, 2022 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147

Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147

Spikhalskiy commented Apr 17, 2022 •

edited

Loading

Spikhalskiy Apr 17, 2022 •

edited

Loading

mfateev Apr 19, 2022

Spikhalskiy Apr 19, 2022

meiliang86 Apr 20, 2022

Spikhalskiy Apr 20, 2022 •

edited

Loading

bergundy Apr 18, 2022

bergundy left a comment

	// otherwise n-1 of them will stuck waiting on a condition that is not in the map already
	// otherwise n-1 of them will be stuck waiting on a condition that is not in the map already

Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147

Reimplemented WorkflowRunLockManager to fix design flaw with an unsafe unlock #1147

Conversation

Spikhalskiy commented Apr 17, 2022 • edited Loading

What was changed

Why?

Spikhalskiy Apr 17, 2022 • edited Loading

Choose a reason for hiding this comment

mfateev Apr 19, 2022

Choose a reason for hiding this comment

Spikhalskiy Apr 19, 2022

Choose a reason for hiding this comment

meiliang86 Apr 20, 2022

Choose a reason for hiding this comment

Spikhalskiy Apr 20, 2022 • edited Loading

Choose a reason for hiding this comment

bergundy Apr 18, 2022

Choose a reason for hiding this comment

bergundy left a comment

Choose a reason for hiding this comment

Spikhalskiy commented Apr 17, 2022 •

edited

Loading

Spikhalskiy Apr 17, 2022 •

edited

Loading

Spikhalskiy Apr 20, 2022 •

edited

Loading