Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: don't unnecessarily restart scheduler #689

Merged
merged 1 commit into from
Oct 18, 2024

Conversation

leg100
Copy link
Owner

@leg100 leg100 commented Oct 18, 2024

The scheduler ensures only one run on each workspace is permitted to be running (the exception to this rule are "plan-only" runs, which we won't discuss here). A run begins in the pending state. It cannot transition to the next state, plan_queued, until:

(a) any other pending runs created before it have finished.
(b) its workspace is unlocked.

Once both conditions are true, the scheduler does the following:

(a) locks the workspace
(b) updates the status of the run to plan_queued
(c) sets the run as the "current run" of the workspace

Once a run finishes it is also responsible for unlocking the workspace once a run completes. One of the bugs identified by the user is a race condition that occurs whenever a run finishes and its workspace is immediately deleted (this is not an untypical scenario, where you're testing changes on an "ephemeral" workspace): the scheduler receives the "run completed" event and often by this time the workspace has already been deleted, but the scheduler isn't aware of that, and it tries to unlock the workspace and it receives an error. This shouldn't be an issue because the error is "workspace not found" and the scheduler should understand that that means the workspace has since been deleted, no action need be taken and to move on. But instead it errantly interpets it as a transient error, and backs off and retries. The fix here is clear.

Another race condition occurs when the "run completed" event is received after the "workspace deleted" event. The scheduler processes the latter event and deletes its cached workspace accordingly. It then receives the "run completed" event and tries to lookup its workspace in its cache and it cannot find it. It reports this as an error to the user, and moves on. The "fix" here is to either accept this as an entirely reasonable race condition and suppress the error message; or to make a change to ensure events are processed in order. In this case I've opted for the former.

#685

@leg100 leg100 merged commit d240965 into master Oct 18, 2024
5 checks passed
@leg100 leg100 deleted the fix-scheduler-unnecessarily-restarting branch October 18, 2024 14:21
leg100 pushed a commit that referenced this pull request Oct 22, 2024
🤖 I have created a release *beep* *boop*
---


## [0.3.0](v0.2.4...v0.3.0)
(2024-10-22)


### ⚠ BREAKING CHANGES

* rename --address flag to --url; require scheme
* move to sqlc, tern ([#683](#683))

### refactor

* move to sqlc, tern ([#683](#683))
([878ebfb](878ebfb))
* rename --address flag to --url; require scheme
([3e83474](3e83474))


### Features

* add timeout settings for plans and applies
([#686](#686))
([797902b](797902b))
* allow subscription buffer size to be overridden
([#687](#687))
([d51469d](d51469d))


### Bug Fixes

* avoid hitting Github limit on commit status updates
([#688](#688))
([029e525](029e525))
* don't unnecessarily restart scheduler
([#689](#689))
([d240965](d240965))
* make linting and tests pass
([ebc1e53](ebc1e53))
* pin version of gcp pub-sub emulator docker image
([8048e72](8048e72))
* prevent subsystem failure from stopping otfd
([e5061b0](e5061b0))
* use base58 alphabet for resource IDs
([#680](#680))
([1e7d7a2](1e7d7a2))


### Miscellaneous

* bump go
([5663eab](5663eab))
* trigger new version of agent chart upon deploy
([#690](#690))
([155e026](155e026))
* unarchive
([c954c36](c954c36))
* upgrade dependencies
([59eb979](59eb979))
* use forked sse lib's module path
([fc9b138](fc9b138))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants