-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Recalculate schedule times from previous action on update #5381
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Actually this replay test is not effective here since it doesn't exercise the bug. I'll remove it and try to create one that does, but we can review and merge in the meantime. |
ast2023
reviewed
Feb 2, 2024
ast2023
reviewed
Feb 2, 2024
ast2023
reviewed
Feb 2, 2024
ast2023
approved these changes
Feb 28, 2024
stephanos
approved these changes
Mar 2, 2024
stephanos
pushed a commit
to stephanos/temporal
that referenced
this pull request
Mar 21, 2024
…#5381) ## What changed? After an update, the schedule workflow goes back to the previous action/update and recomputes next times from there, instead of from the current time. ## Why? Fixes temporalio#4866. Fixes the situation where an update happens in between the nominal time and the jittered time of an action. ## How did you test it? new unit test
dnr
added a commit
to dnr/temporal
that referenced
this pull request
Apr 9, 2024
dnr
added a commit
that referenced
this pull request
Apr 10, 2024
…5698) ## What changed? Activate schedule workflow logic changes. Some of this code is not in 1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to the 1.23 series. ## Why? Fix bugs, support new features, make more efficient. ## How did you test it? existing tests (on those PRs) ## Potential risks schedule workflow determinism errors
dnr
added a commit
that referenced
this pull request
Apr 10, 2024
…5698) ## What changed? Activate schedule workflow logic changes. Some of this code is not in 1.23.0, but we can patch those PRs into 1.23.1 to support downgrades to the 1.23 series. ## Why? Fix bugs, support new features, make more efficient. ## How did you test it? existing tests (on those PRs) ## Potential risks schedule workflow determinism errors
yycptt
pushed a commit
that referenced
this pull request
Apr 27, 2024
## What changed? After an update, the schedule workflow goes back to the previous action/update and recomputes next times from there, instead of from the current time. ## Why? Fixes #4866. Fixes the situation where an update happens in between the nominal time and the jittered time of an action. ## How did you test it? new unit test
dnr
added a commit
that referenced
this pull request
May 31, 2024
## What changed? In the schedule workflow, we should always reprocess from the last action in order to handle getting woken up by a signal in between the nominal time and actual time. This is basically what #5381 should have been in the first place, and applies that logic to all iterations. Note that this adds the logic but doesn't activate it yet. ## Why? Fixes bug: signals (including refresh) in between nominal and actual time could lead to dropped actions. This only happens if the cache runs out or we do a CaN at just the right time, so it's not that easy to reproduce. This has also blocked activating #5799 since that makes this bug more likely. ## How did you test it? Added new unit test and replay test. Reproduced locally by disabling cache and sending frequent signals to try to disrupt a schedule. Verified that new version did not drop actions. ## Potential risks This changes workflow logic, but is pretty easy to see that the old control flow is unaffected. There's one more potential situation which isn't always handled correctly: unpause in between nominal and actual times will usually run the jittered action (this is probably the less surprising behavior), but rarely, if a CaN happens at the same time, it can get skipped because the cache will be regenerated. --------- Co-authored-by: Rodrigo Zhou <rodrigozhou@users.noreply.github.com>
pdoerner
pushed a commit
that referenced
this pull request
May 31, 2024
## What changed? In the schedule workflow, we should always reprocess from the last action in order to handle getting woken up by a signal in between the nominal time and actual time. This is basically what #5381 should have been in the first place, and applies that logic to all iterations. Note that this adds the logic but doesn't activate it yet. ## Why? Fixes bug: signals (including refresh) in between nominal and actual time could lead to dropped actions. This only happens if the cache runs out or we do a CaN at just the right time, so it's not that easy to reproduce. This has also blocked activating #5799 since that makes this bug more likely. ## How did you test it? Added new unit test and replay test. Reproduced locally by disabling cache and sending frequent signals to try to disrupt a schedule. Verified that new version did not drop actions. ## Potential risks This changes workflow logic, but is pretty easy to see that the old control flow is unaffected. There's one more potential situation which isn't always handled correctly: unpause in between nominal and actual times will usually run the jittered action (this is probably the less surprising behavior), but rarely, if a CaN happens at the same time, it can get skipped because the cache will be regenerated. --------- Co-authored-by: Rodrigo Zhou <rodrigozhou@users.noreply.github.com>
lina-temporal
added a commit
that referenced
this pull request
Jun 14, 2024
… update time, respect RemainingActions counter (#6122) ## What changed/why? In #6028/#5381, our scheduler's run loop accounted for update requests that occur between nominal and actual/jittered times. The run loop will correctly skip over action times who are scheduled prior to the schedule's update time (such as when a schedule is recalculated to a new spec). This PR updates `getFutureActionTimes` (used as part of the scheduler's describe and list operations) to account for action times preceding the update time, as well as to properly reflect the number of remaining actions in a schedule. ## How did you test it? - Updated unit tests for both limited actions (`TestLimitedActions`) as well as filtered update times (`TestUpdateNotRetroactive`). `TestUpdateNotRetroactive` is the test that goes red when the [parallel condition in the scheduler](https://github.com/temporalio/temporal/blob/44a6fe777e1dc950dd278cb1b7f6ffc8b3ba4328/service/worker/scheduler/workflow.go#L655-L660) is commented out, so it seemed the appropriate place to also test `getFutureActionTimes` lines up with the fix in the run loop. - Verified both updated tests only pass when new version (`AccurateFutureActionTimes`) is set; verified no other tests break (`make test-unit`) ## Potential risks - We break a use case that relied on speculative future action times/a constant number of action times returned ## Documentation N/A ## Is hotfix candidate? No
stephanos
pushed a commit
that referenced
this pull request
Jun 14, 2024
… update time, respect RemainingActions counter (#6122) ## What changed/why? In #6028/#5381, our scheduler's run loop accounted for update requests that occur between nominal and actual/jittered times. The run loop will correctly skip over action times who are scheduled prior to the schedule's update time (such as when a schedule is recalculated to a new spec). This PR updates `getFutureActionTimes` (used as part of the scheduler's describe and list operations) to account for action times preceding the update time, as well as to properly reflect the number of remaining actions in a schedule. ## How did you test it? - Updated unit tests for both limited actions (`TestLimitedActions`) as well as filtered update times (`TestUpdateNotRetroactive`). `TestUpdateNotRetroactive` is the test that goes red when the [parallel condition in the scheduler](https://github.com/temporalio/temporal/blob/44a6fe777e1dc950dd278cb1b7f6ffc8b3ba4328/service/worker/scheduler/workflow.go#L655-L660) is commented out, so it seemed the appropriate place to also test `getFutureActionTimes` lines up with the fix in the run loop. - Verified both updated tests only pass when new version (`AccurateFutureActionTimes`) is set; verified no other tests break (`make test-unit`) ## Potential risks - We break a use case that relied on speculative future action times/a constant number of action times returned ## Documentation N/A ## Is hotfix candidate? No
ychebotarev
pushed a commit
that referenced
this pull request
Jun 25, 2024
… update time, respect RemainingActions counter (#6122) ## What changed/why? In #6028/#5381, our scheduler's run loop accounted for update requests that occur between nominal and actual/jittered times. The run loop will correctly skip over action times who are scheduled prior to the schedule's update time (such as when a schedule is recalculated to a new spec). This PR updates `getFutureActionTimes` (used as part of the scheduler's describe and list operations) to account for action times preceding the update time, as well as to properly reflect the number of remaining actions in a schedule. ## How did you test it? - Updated unit tests for both limited actions (`TestLimitedActions`) as well as filtered update times (`TestUpdateNotRetroactive`). `TestUpdateNotRetroactive` is the test that goes red when the [parallel condition in the scheduler](https://github.com/temporalio/temporal/blob/44a6fe777e1dc950dd278cb1b7f6ffc8b3ba4328/service/worker/scheduler/workflow.go#L655-L660) is commented out, so it seemed the appropriate place to also test `getFutureActionTimes` lines up with the fix in the run loop. - Verified both updated tests only pass when new version (`AccurateFutureActionTimes`) is set; verified no other tests break (`make test-unit`) ## Potential risks - We break a use case that relied on speculative future action times/a constant number of action times returned ## Documentation N/A ## Is hotfix candidate? No
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changed?
After an update, the schedule workflow goes back to the previous action/update and recomputes next times from there, instead of from the current time.
Why?
Fixes #4866. Fixes the situation where an update happens in between the nominal time and the jittered time of an action.
How did you test it?
new unit test