feat: pebble check --refresh #577

IronCore864 · 2025-02-26T03:48:14Z

Run a check immediately with pebble check --refresh <check>.

A new API endpoint /v1/checks/refresh with POST is added with the admin access level.

Some manual tests:

$ # check success
$ pebble check chk1
name: chk1
startup: enabled
status: up
failures: 0
threshold: 3
change-id: "1"

$ # check failures > 0
$ pebble check chk2
name: chk2
startup: enabled
status: down
failures: 8
threshold: 3
change-id: "4"
logs: |
    2025-03-05T16:35:02+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:07+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:12+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:17+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:22+08:00 ERROR non-2xx status code 500; Health check failed

$ # refresh success
$ pebble check chk1 --refresh
name: chk1
startup: enabled
status: up
failures: 0
threshold: 3
change-id: "1"

$ # refresh failure
$ pebble check chk2 --refresh
name: chk2
startup: enabled
status: down
failures: 11
threshold: 3
change-id: "4"
error: non-2xx status code 500
logs: |
    2025-03-05T16:35:02+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:07+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:12+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:17+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:22+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:27+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:32+08:00 ERROR non-2xx status code 500; Health check failed
    2025-03-05T16:35:35+08:00 ERROR non-2xx status code 500; Health check failed
$ echo $?
0

$ # refresh some other error
$ pebble check chk3 --refresh
name: chk3
startup: enabled
status: up
failures: 2
threshold: 3
change-id: "3"
error: check timed out after 1s
logs: |
    2025-03-05T16:38:32+08:00 ERROR check timed out after 1s

…the spec

client/checks.go

client/checks_test.go

docs/specs/openapi.yaml

internals/cli/cmd_check.go

internals/overlord/checkstate/manager.go

internals/overlord/checkstate/manager_test.go

benhoyt · 2025-03-10T23:30:55Z

Hmm, oddly, go test -race is failing in CI (but not locally, at least for me), see failing run. I've re-run it but it still fails, so I think we should investigate more deeply. It doesn't look like it's related to these changes, but maybe it is.

Otherwise, I'm happy with this PR. I do have a design question about the error details in the --refresh case, because when the check is disabled it doesn't go via changes&tasks, so no logs are shown, and the error message is less helpful. We need to figure out a way to make it helpful all the time. I have some ideas I've been playing with locally, but let's chat after daily sync.

hpidcock · 2025-03-11T03:44:58Z

Hmm, oddly, go test -race is failing in CI (but not locally, at least for me), see failing run. I've re-run it but it still fails, so I think we should investigate more deeply. It doesn't look like it's related to these changes, but maybe it is.

Otherwise, I'm happy with this PR. I do have a design question about the error details in the --refresh case, because when the check is disabled it doesn't go via changes&tasks, so no logs are shown, and the error message is less helpful. We need to figure out a way to make it helpful all the time. I have some ideas I've been playing with locally, but let's chat after daily sync.

Using:

$ cd internals/daemon
$ go install golang.org/x/tools/cmd/stress@latest
$ go test -c -race
$ stress ./daemon.test

I get the race reliably on go 1.24.1.

hpidcock

Looks pretty good, mostly happy with the approach. It is a little strange that a stopped check is processed in a different way. But I guess, fundamentally they both serve the same purpose, just in one case the user may be unaware that the check is stopped.

hpidcock · 2025-03-11T04:24:53Z

internals/overlord/checkstate/handlers.go

+			select {
+			case info := <-refresh:
+				// If refresh requested while running check, send result.
+				select {
+				case info.result <- err:
+				case <-info.ctx.Done():
+				}
+			default:
+			}


This is an interesting optimisation that highlights a potential issue with multiple requests to refresh the same check.

Since all refreshes to this same check will be queued, there is a case where multiple requests are queued up causing this to behave rather strangely.

I suggest something like this:

respondToRequests := func(result error) { timeout := time.NewTimer(time.Millisecond) for { select { case <-timeout.C: return case info := <-refresh: select { case info.result <- result: case <-info.ctx.Done(): } case <-tomb.Dying(): timeout.Stop() return } } } for { select { case info := <-refresh: // Reset ticker on refresh. ticker.Reset(config.Period.Value) shouldExit, err := performCheck() select { case info.result <- err: case <-info.ctx.Done(): } respondToRequests(err) if shouldExit { return err } case <-ticker.C: shouldExit, err := performCheck() respondToRequests(err) if shouldExit { return err } case <-tomb.Dying(): return checkStopped(config.Name, task.Kind(), tomb.Err()) } }

Interesting idea to empty the refresh request queue (or at least try for 1ms). I'm not entirely sure it's worth it, though. It seems like realistically there will only ever be 1 waiting, because it's likely from a CLI user, and if there's more, it gets run again.

But if we do this, what about just looping till the channel is empty, something like this?

emptyRefreshQueue := func(result error) { for { select { case info := <-refresh: select { case info.result <- result: case <-info.ctx.Done(): } default: return } } }

If info.result was buffered instead, then I'd be OK with an potentially infinite loop. But because it is unbuffered, then there is a greater possibility that this is an infinite loop.

So setting an upper bound based on time is appropriate IMO, or it could be an upper bound on number of iterations. Either works for me.

But logically, now that I think about it, it is only appropriate to return a result to a request that happens-before the performCheck starts.

So perhaps for now, we go with the simplest approach which is:

for { select { case info := <-refresh: // Reset ticker on refresh. ticker.Reset(config.Period.Value) shouldExit, err := performCheck() select { case info.result <- err: case <-info.ctx.Done(): } if shouldExit { return err } case <-ticker.C: shouldExit, err := performCheck() if shouldExit { return err } case <-tomb.Dying(): return checkStopped(config.Name, task.Kind(), tomb.Err()) } }

I guess that makes sense: they're really asking for a new refresh to start after they asked. Okay, that's simplest anyway -- can you please make that change @IronCore864?

I have refactored it according to the above discussion.

Regarding the block issue, I don't think it's a big problem because as Ben mentioned it's only used with CLI, and even if there are two refreshes with CLI, both can get the result, it's just that the first request would block the second, so the second takes longer to be responded. In fact, at the beginning, I thought of something like a request ID to support multiple refreshes at the same time, but in the end, I think it doesn't add much value, so I used an unbuffered channel.

IronCore864 · 2025-03-11T08:09:08Z

Hmm, oddly, go test -race is failing in CI (but not locally, at least for me), see failing run. I've re-run it but it still fails, so I think we should investigate more deeply. It doesn't look like it's related to these changes, but maybe it is.
Otherwise, I'm happy with this PR. I do have a design question about the error details in the --refresh case, because when the check is disabled it doesn't go via changes&tasks, so no logs are shown, and the error message is less helpful. We need to figure out a way to make it helpful all the time. I have some ideas I've been playing with locally, but let's chat after daily sync.

Using:
$ cd internals/daemon
$ go install golang.org/x/tools/cmd/stress@latest
$ go test -c -race
$ stress ./daemon.test
I get the race reliably on go 1.24.1.

Thanks for the help! According to the log, the race appears in the TestStateChange test, so I only tried to stress it and related tests and could not reproduce it. If I stress the whole daemon test suite, I can also reproduce it now.

After checking the code, I think the race is because timeNow is a global var accessed by both changeStatus (read, and in a few other functions) and tests (write, in FakeTime) without synchronization. The simplest solution is to add something like timeNowMutex and lock it in FakeTime?

benhoyt · 2025-03-11T21:26:09Z

If I stress the whole daemon test suite, I can also reproduce it now.

After checking the code, I think the race is because timeNow is a global var accessed by both changeStatus (read, and in a few other functions) and tests (write, in FakeTime) without synchronization. The simplest solution is to add something like timeNowMutex and lock it in FakeTime?

I can't even repro it locally under stress when running the whole daemon test suite. However, that's a clue -- it is because timeNow is a global var being accessed by the tests and by changeStatus (via the overlord loop). The overlord loop is only being started manually by some tests -- see apiSuite.startOverlord. If I add s.startOverlord() right under the d := s.daemon(c) line in TestStateChange, then just running that individual test (not under stress) with -race reproduces the data race pretty reliably for me:

go test -race -v ./internals/daemon/ -check.vv -check.f '^TestStateChange$'

That means that it must be one of the other daemon tests starting the overlord loop interfering with this test. I don't quite understand why, as apiSuite.TearDownTest stops the overlord and waits for the overlord loop to finish. Oh wait, I guess it's because TaskRunner.run fires up another goroutine ... and that's the one that's calling changeStatus, and we're not waiting for that goroutine to finish.

Either way, it'd be nice if we can solve it by have the other tests clean up properly after themselves, rather than wrapping all the timeNow() calls in a mutex.

Test

IronCore864 · 2025-03-12T05:14:01Z

After some investigation, I think other tests, in theory, shouldn't interfere with TestStateChange (and tests that call FakeTime/write timeNow for that matter) as long as the overlord loop is by startOverlord, because in this case, a flag is set, then in apiSuite.TearDownTest, Overlord.Stop is called, which in turn calls TaskRunner.Stop, meaning the clean up is already done properly.

Adding startOverlord in TestStateChange (in the same test where FakeTime is called) can reproduce the race condition because, in this case, the deferred write on timeNow happens before the teardown when the overlord/taskrunner is stopped, so in theory, the taskrunner can still changeStatus when timeNow is being written.

I did find two test cases in the same package/test suite that start the overlord loop without using startOverlord, which I hope can fix the issue. I'm not 100% sure about this, though, because the original issue doesn't always occur, even in GitHub Actions. Anyway, with a lot of stress (50 min, 6 failures, all unrelated to the race condition) and -race tests (50 runs) locally, and a few reruns in GitHub Actions, I couldn't reproduce it for now.

IronCore864 added 30 commits November 13, 2024 17:08

poc: a metrics module for pebble

f76470c

chore: undo unnecessary change

6d8ee59

chore: undo unnecessary change

b4abc9a

chore: undo unnecessary change

a274276

chore: metrics identity basic auth poc

a2c07e6

chore: a poc for metrics with labels

4ebb633

poc: remove adding identities using env vars according to comment in …

7468b95

…the spec

chore: update tests for the metrics lib poc

790a8f9

chore: refactor identities and access according to spec review

272005b

feat: use sha512 to verify password

5be3e96

feat: move the metrics api to /v1/metrics

a6c374d

chore: remove Username from apiBasicIdentity

1bd54cb

chore: revert changes on user state

98ea11e

Merge branch 'master' into poc-custom-metrics-lib

68c18b7

chore: fix failed identity tests

7fc255e

test: unit tests for basic identity

31a0617

feat: add basic identity

306f2d3

chore: update comments

6c53491

chore: rework the metrics for services

a57f041

chore: add metrics for checks, not done

b7a442f

chore: refactor according to review and add more unit tests

e49419d

chore: refactor metrics, add open telemetry writer

8ebebb8

Merge branch 'master' into poc-custom-metrics-lib

363eaf0

chore: refactor according to review, fix check counter reset issue

a1db1a6

Merge branch 'master' into poc-custom-metrics-lib

047cc42

chore: add a test for check metrics

c527344

test: add tests for open telemetry writer

5476299

test: service metrics

9e5b65a

chore: update tests

d678766

chore: fix linting

c99fa53

IronCore864 added 10 commits March 5, 2025 10:41

chore: return error for refresh

18cbff8

test: minor change

33b0250

chore: refactor according to review, wip

0b228b8

chore: remove debugging logs

60caa2e

test: add a test for v1PostChecksRefresh

0963cc0

chore: remove comment

6ba6abd

chore: run check directly if stopped

e1e6375

chore: show logs when failure is not zero without refresh

27ca4ec

chore: lint imports

b6aeb9f

chore: reorder tests

1388683

IronCore864 changed the title ~~[WIP] feat: pebble check --refresh~~ feat: pebble check --refresh Mar 5, 2025

IronCore864 marked this pull request as ready for review March 5, 2025 08:45

IronCore864 requested a review from benhoyt March 5, 2025 08:45

benhoyt reviewed Mar 10, 2025

View reviewed changes

IronCore864 added 3 commits March 10, 2025 16:37

chore: refactor according to review

f621415

chore: fix static check

65c169e

chore: minor change

cc22e96

IronCore864 requested a review from benhoyt March 10, 2025 08:54

hpidcock reviewed Mar 11, 2025

View reviewed changes

chore: always use details error

f92b304

IronCore864 added 4 commits March 12, 2025 10:14

chore: run race tests

f0b271a

chore: run race tests

4929a25

chore: try to fix racy test

0bfcd82

Merge pull request #6 from IronCore864/test

41ddab8

Test

chore: refactor according to comment

a98a642

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pebble check --refresh #577

feat: pebble check --refresh #577

IronCore864 commented Feb 26, 2025 •

edited

Loading

benhoyt commented Mar 10, 2025

hpidcock commented Mar 11, 2025 •

edited by benhoyt

Loading

hpidcock left a comment

hpidcock Mar 11, 2025

benhoyt Mar 11, 2025

hpidcock Mar 12, 2025

benhoyt Mar 12, 2025

IronCore864 Mar 12, 2025

IronCore864 commented Mar 11, 2025 •

edited

Loading

benhoyt commented Mar 11, 2025

IronCore864 commented Mar 12, 2025

feat: pebble check --refresh #577

Are you sure you want to change the base?

feat: pebble check --refresh #577

Conversation

IronCore864 commented Feb 26, 2025 • edited Loading

benhoyt commented Mar 10, 2025

hpidcock commented Mar 11, 2025 • edited by benhoyt Loading

hpidcock left a comment

Choose a reason for hiding this comment

hpidcock Mar 11, 2025

Choose a reason for hiding this comment

benhoyt Mar 11, 2025

Choose a reason for hiding this comment

hpidcock Mar 12, 2025

Choose a reason for hiding this comment

benhoyt Mar 12, 2025

Choose a reason for hiding this comment

IronCore864 Mar 12, 2025

Choose a reason for hiding this comment

IronCore864 commented Mar 11, 2025 • edited Loading

benhoyt commented Mar 11, 2025

IronCore864 commented Mar 12, 2025

IronCore864 commented Feb 26, 2025 •

edited

Loading

hpidcock commented Mar 11, 2025 •

edited by benhoyt

Loading

IronCore864 commented Mar 11, 2025 •

edited

Loading