Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

feat: pebble check --refresh #577

Open
wants to merge 116 commits into
base: master
Choose a base branch
from

Conversation

IronCore864
Copy link
Contributor

@IronCore864 IronCore864 commented Feb 26, 2025

Run a check immediately with pebble check --refresh <check>.

A new API endpoint /v1/checks/refresh with POST is added with the admin access level.

Some manual tests:

$ # check success
$ pebble check chk1
name: chk1
startup: enabled
status: up
failures: 0
threshold: 3
change-id: "1"

$ # check failures > 0
$ pebble check chk2
name: chk2
startup: enabled
status: down
failures: 8
threshold: 3
change-id: "4"
logs: |
    2025-03-05T16:35:02+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:07+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:12+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:17+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:22+08:00 ERROR non-2xx status code 500; Health check failed

$ # refresh success
$ pebble check chk1 --refresh
name: chk1
startup: enabled
status: up
failures: 0
threshold: 3
change-id: "1"

$ # refresh failure
$ pebble check chk2 --refresh
name: chk2
startup: enabled
status: down
failures: 11
threshold: 3
change-id: "4"
error: non-2xx status code 500
logs: |
    2025-03-05T16:35:02+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:07+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:12+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:17+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:22+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:27+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:32+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:35+08:00 ERROR non-2xx status code 500; Health check failed
$ echo $?
0

$ # refresh some other error
$ pebble check chk3 --refresh
name: chk3
startup: enabled
status: up
failures: 2
threshold: 3
change-id: "3"
error: check timed out after 1s
logs: |
    2025-03-05T16:38:32+08:00 ERROR check timed out after 1s

@IronCore864 IronCore864 changed the title [WIP] feat: pebble check --refresh feat: pebble check --refresh Mar 5, 2025
@IronCore864 IronCore864 marked this pull request as ready for review March 5, 2025 08:45
@IronCore864 IronCore864 requested a review from benhoyt March 5, 2025 08:45
@IronCore864 IronCore864 requested a review from benhoyt March 10, 2025 08:54
@benhoyt
Copy link
Contributor

benhoyt commented Mar 10, 2025

Hmm, oddly, go test -race is failing in CI (but not locally, at least for me), see failing run. I've re-run it but it still fails, so I think we should investigate more deeply. It doesn't look like it's related to these changes, but maybe it is.

Otherwise, I'm happy with this PR. I do have a design question about the error details in the --refresh case, because when the check is disabled it doesn't go via changes&tasks, so no logs are shown, and the error message is less helpful. We need to figure out a way to make it helpful all the time. I have some ideas I've been playing with locally, but let's chat after daily sync.

@hpidcock
Copy link
Member

hpidcock commented Mar 11, 2025

Hmm, oddly, go test -race is failing in CI (but not locally, at least for me), see failing run. I've re-run it but it still fails, so I think we should investigate more deeply. It doesn't look like it's related to these changes, but maybe it is.

Otherwise, I'm happy with this PR. I do have a design question about the error details in the --refresh case, because when the check is disabled it doesn't go via changes&tasks, so no logs are shown, and the error message is less helpful. We need to figure out a way to make it helpful all the time. I have some ideas I've been playing with locally, but let's chat after daily sync.

Using:

$ cd internals/daemon
$ go install golang.org/x/tools/cmd/stress@latest
$ go test -c -race
$ stress ./daemon.test

I get the race reliably on go 1.24.1.

Copy link
Member

@hpidcock hpidcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, mostly happy with the approach. It is a little strange that a stopped check is processed in a different way. But I guess, fundamentally they both serve the same purpose, just in one case the user may be unaware that the check is stopped.

Comment on lines 121 to 129
select {
case info := <-refresh:
// If refresh requested while running check, send result.
select {
case info.result <- err:
case <-info.ctx.Done():
}
default:
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting optimisation that highlights a potential issue with multiple requests to refresh the same check.

Since all refreshes to this same check will be queued, there is a case where multiple requests are queued up causing this to behave rather strangely.

I suggest something like this:

	respondToRequests := func(result error) {
		timeout := time.NewTimer(time.Millisecond)
		for {
			select {
			case <-timeout.C:
				return
			case info := <-refresh:
				select {
				case info.result <- result:
				case <-info.ctx.Done():
				}
			case <-tomb.Dying():
				timeout.Stop()
				return
			}
		}
	}

	for {
		select {
		case info := <-refresh:
			// Reset ticker on refresh.
			ticker.Reset(config.Period.Value)
			shouldExit, err := performCheck()
			select {
			case info.result <- err:
			case <-info.ctx.Done():
			}
			respondToRequests(err)
			if shouldExit {
				return err
			}
		case <-ticker.C:
			shouldExit, err := performCheck()
			respondToRequests(err)
			if shouldExit {
				return err
			}
		case <-tomb.Dying():
			return checkStopped(config.Name, task.Kind(), tomb.Err())
		}
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting idea to empty the refresh request queue (or at least try for 1ms). I'm not entirely sure it's worth it, though. It seems like realistically there will only ever be 1 waiting, because it's likely from a CLI user, and if there's more, it gets run again.

But if we do this, what about just looping till the channel is empty, something like this?

emptyRefreshQueue := func(result error) {
	for {
		select {
		case info := <-refresh:
			select {
			case info.result <- result:
			case <-info.ctx.Done():
			}
		default:
			return
		}
	}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If info.result was buffered instead, then I'd be OK with an potentially infinite loop. But because it is unbuffered, then there is a greater possibility that this is an infinite loop.

So setting an upper bound based on time is appropriate IMO, or it could be an upper bound on number of iterations. Either works for me.

But logically, now that I think about it, it is only appropriate to return a result to a request that happens-before the performCheck starts.

So perhaps for now, we go with the simplest approach which is:

	for {
		select {
		case info := <-refresh:
			// Reset ticker on refresh.
			ticker.Reset(config.Period.Value)
			shouldExit, err := performCheck()
			select {
			case info.result <- err:
			case <-info.ctx.Done():
			}
			if shouldExit {
				return err
			}
		case <-ticker.C:
			shouldExit, err := performCheck()
			if shouldExit {
				return err
			}
		case <-tomb.Dying():
			return checkStopped(config.Name, task.Kind(), tomb.Err())
		}
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that makes sense: they're really asking for a new refresh to start after they asked. Okay, that's simplest anyway -- can you please make that change @IronCore864?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have refactored it according to the above discussion.

Regarding the block issue, I don't think it's a big problem because as Ben mentioned it's only used with CLI, and even if there are two refreshes with CLI, both can get the result, it's just that the first request would block the second, so the second takes longer to be responded. In fact, at the beginning, I thought of something like a request ID to support multiple refreshes at the same time, but in the end, I think it doesn't add much value, so I used an unbuffered channel.

@IronCore864
Copy link
Contributor Author

IronCore864 commented Mar 11, 2025

Hmm, oddly, go test -race is failing in CI (but not locally, at least for me), see failing run. I've re-run it but it still fails, so I think we should investigate more deeply. It doesn't look like it's related to these changes, but maybe it is.
Otherwise, I'm happy with this PR. I do have a design question about the error details in the --refresh case, because when the check is disabled it doesn't go via changes&tasks, so no logs are shown, and the error message is less helpful. We need to figure out a way to make it helpful all the time. I have some ideas I've been playing with locally, but let's chat after daily sync.

Using:

$ cd internals/daemon
$ go install golang.org/x/tools/cmd/stress@latest
$ go test -c -race
$ stress ./daemon.test

I get the race reliably on go 1.24.1.

Thanks for the help! According to the log, the race appears in the TestStateChange test, so I only tried to stress it and related tests and could not reproduce it. If I stress the whole daemon test suite, I can also reproduce it now.

After checking the code, I think the race is because timeNow is a global var accessed by both changeStatus (read, and in a few other functions) and tests (write, in FakeTime) without synchronization. The simplest solution is to add something like timeNowMutex and lock it in FakeTime?

@benhoyt
Copy link
Contributor

benhoyt commented Mar 11, 2025

If I stress the whole daemon test suite, I can also reproduce it now.

After checking the code, I think the race is because timeNow is a global var accessed by both changeStatus (read, and in a few other functions) and tests (write, in FakeTime) without synchronization. The simplest solution is to add something like timeNowMutex and lock it in FakeTime?

I can't even repro it locally under stress when running the whole daemon test suite. However, that's a clue -- it is because timeNow is a global var being accessed by the tests and by changeStatus (via the overlord loop). The overlord loop is only being started manually by some tests -- see apiSuite.startOverlord. If I add s.startOverlord() right under the d := s.daemon(c) line in TestStateChange, then just running that individual test (not under stress) with -race reproduces the data race pretty reliably for me:

go test -race -v ./internals/daemon/ -check.vv -check.f '^TestStateChange$'

That means that it must be one of the other daemon tests starting the overlord loop interfering with this test. I don't quite understand why, as apiSuite.TearDownTest stops the overlord and waits for the overlord loop to finish. Oh wait, I guess it's because TaskRunner.run fires up another goroutine ... and that's the one that's calling changeStatus, and we're not waiting for that goroutine to finish.

Either way, it'd be nice if we can solve it by have the other tests clean up properly after themselves, rather than wrapping all the timeNow() calls in a mutex.

@IronCore864
Copy link
Contributor Author

After some investigation, I think other tests, in theory, shouldn't interfere with TestStateChange (and tests that call FakeTime/write timeNow for that matter) as long as the overlord loop is by startOverlord, because in this case, a flag is set, then in apiSuite.TearDownTest, Overlord.Stop is called, which in turn calls TaskRunner.Stop, meaning the clean up is already done properly.

Adding startOverlord in TestStateChange (in the same test where FakeTime is called) can reproduce the race condition because, in this case, the deferred write on timeNow happens before the teardown when the overlord/taskrunner is stopped, so in theory, the taskrunner can still changeStatus when timeNow is being written.

I did find two test cases in the same package/test suite that start the overlord loop without using startOverlord, which I hope can fix the issue. I'm not 100% sure about this, though, because the original issue doesn't always occur, even in GitHub Actions. Anyway, with a lot of stress (50 min, 6 failures, all unrelated to the race condition) and -race tests (50 runs) locally, and a few reruns in GitHub Actions, I couldn't reproduce it for now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants