-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
feat: pebble check --refresh #577
base: master
Are you sure you want to change the base?
feat: pebble check --refresh #577
Conversation
Hmm, oddly, Otherwise, I'm happy with this PR. I do have a design question about the error details in the |
Using:
I get the race reliably on go 1.24.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, mostly happy with the approach. It is a little strange that a stopped check is processed in a different way. But I guess, fundamentally they both serve the same purpose, just in one case the user may be unaware that the check is stopped.
select { | ||
case info := <-refresh: | ||
// If refresh requested while running check, send result. | ||
select { | ||
case info.result <- err: | ||
case <-info.ctx.Done(): | ||
} | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting optimisation that highlights a potential issue with multiple requests to refresh the same check.
Since all refreshes to this same check will be queued, there is a case where multiple requests are queued up causing this to behave rather strangely.
I suggest something like this:
respondToRequests := func(result error) {
timeout := time.NewTimer(time.Millisecond)
for {
select {
case <-timeout.C:
return
case info := <-refresh:
select {
case info.result <- result:
case <-info.ctx.Done():
}
case <-tomb.Dying():
timeout.Stop()
return
}
}
}
for {
select {
case info := <-refresh:
// Reset ticker on refresh.
ticker.Reset(config.Period.Value)
shouldExit, err := performCheck()
select {
case info.result <- err:
case <-info.ctx.Done():
}
respondToRequests(err)
if shouldExit {
return err
}
case <-ticker.C:
shouldExit, err := performCheck()
respondToRequests(err)
if shouldExit {
return err
}
case <-tomb.Dying():
return checkStopped(config.Name, task.Kind(), tomb.Err())
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea to empty the refresh request queue (or at least try for 1ms). I'm not entirely sure it's worth it, though. It seems like realistically there will only ever be 1 waiting, because it's likely from a CLI user, and if there's more, it gets run again.
But if we do this, what about just looping till the channel is empty, something like this?
emptyRefreshQueue := func(result error) {
for {
select {
case info := <-refresh:
select {
case info.result <- result:
case <-info.ctx.Done():
}
default:
return
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If info.result
was buffered instead, then I'd be OK with an potentially infinite loop. But because it is unbuffered, then there is a greater possibility that this is an infinite loop.
So setting an upper bound based on time is appropriate IMO, or it could be an upper bound on number of iterations. Either works for me.
But logically, now that I think about it, it is only appropriate to return a result to a request that happens-before the performCheck starts.
So perhaps for now, we go with the simplest approach which is:
for {
select {
case info := <-refresh:
// Reset ticker on refresh.
ticker.Reset(config.Period.Value)
shouldExit, err := performCheck()
select {
case info.result <- err:
case <-info.ctx.Done():
}
if shouldExit {
return err
}
case <-ticker.C:
shouldExit, err := performCheck()
if shouldExit {
return err
}
case <-tomb.Dying():
return checkStopped(config.Name, task.Kind(), tomb.Err())
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that makes sense: they're really asking for a new refresh to start after they asked. Okay, that's simplest anyway -- can you please make that change @IronCore864?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have refactored it according to the above discussion.
Regarding the block issue, I don't think it's a big problem because as Ben mentioned it's only used with CLI, and even if there are two refreshes with CLI, both can get the result, it's just that the first request would block the second, so the second takes longer to be responded. In fact, at the beginning, I thought of something like a request ID to support multiple refreshes at the same time, but in the end, I think it doesn't add much value, so I used an unbuffered channel.
Thanks for the help! According to the log, the race appears in the After checking the code, I think the race is because |
I can't even repro it locally under stress when running the whole daemon test suite. However, that's a clue -- it is because
That means that it must be one of the other daemon tests starting the overlord loop interfering with this test. I don't quite understand why, as Either way, it'd be nice if we can solve it by have the other tests clean up properly after themselves, rather than wrapping all the |
After some investigation, I think other tests, in theory, shouldn't interfere with Adding I did find two test cases in the same package/test suite that start the overlord loop without using |
Run a check immediately with
pebble check --refresh <check>
.A new API endpoint
/v1/checks/refresh
withPOST
is added with the admin access level.Some manual tests: