-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Don't fail checkin if agent has upgrade details with no action #3991
Don't fail checkin if agent has upgrade details with no action #3991
Conversation
This pull request does not have a backport label. Could you fix it @jillguyonnet? 🙏
|
|
After chatting about this with @cmacknz we would like to change this behaviour and not fail nor even return a 5xx whenever this is happening. At most, a warning should be logged but that would be it. |
Thanks @jlind23 - I'm not sure how to test this properly yet but I pushed a commit as it seems like a minimal change. |
I would like at least to get @michel-laterman's eyes on this before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a changelog entry for this.
Also we can add an e2e test, we can enrol the agent then checkin with upgrade_details where we specify the action id
To add some context for why I suggested this, if you run We need the system to keep operating normally if this happens, this should not be a way to block agent from successfully checking in. Taking a larger step back, if an agent says it is updating, and Fleet doesn't expect that, the agent should probably be flagged with an "unexpected update" status or something in the Fleet UI. Most users are not going to look in the Fleet Server logs to see things like this. We could also arguably prevent using |
Thanks @michel-laterman - I've added a changelog entry. I've tried adding an e2e test but running into issues so far. I'll reach out to you about this. |
Thank you @cmacknz for the added context, it makes definite sense that a non existent agent action should not cause the checkin to fail. On the serverless project where the issue is happening, the agentless agent version was updated as part of a new k8s deployment running a newer version of the agent. IIUC, because this is not a Fleet-managed upgrade process, it could have resulted in agents with upgrade details but no agent actions. I'm waiting to check the proposed fix implementation with @michel-laterman. It sounds like it would be worth capturing your last observation (handling non Fleet managed upgrades) somewhere else, perhaps in an ingest-dev issue? |
Containers don't trigger the upgrade process, and you can't send upgrade actions to agents in containers because they report themselves (or should) as non-upgradable. You can exec into a container and run I am a little bit hesitant to start forbidding use of the |
changelog/fragments/1728652985-Dont-fail-checkin-if-upgrade-action-not-found.yaml
Outdated
Show resolved
Hide resolved
changelog/fragments/1728652985-Dont-fail-checkin-if-upgrade-action-not-found.yaml
Outdated
Show resolved
Hide resolved
5748a26
to
4b95a61
Compare
4b95a61
to
de405dd
Compare
I'm using the CI to run my test as e2e tests for the time being are they are not running properly locally, this might take a few commits. |
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks for adding the e2e test!
Don't fail checkin if agent has upgrade details with no action (cherry picked from commit 9dc43b0)
Should we have this fix also in 8.16? |
@pierrehilbert Yes, it would make sense. Does the |
No, we already created the 8.16 branch, I'm adding the backport to 8.16, thanks for confirming :-) |
Don't fail checkin if agent has upgrade details with no action (cherry picked from commit 9dc43b0)
Don't fail checkin if agent has upgrade details with no action (cherry picked from commit 9dc43b0)
What is the problem this PR solves?
Relates https://github.com/elastic/ingest-dev/issues/4223
We're investigating an issue in a serverless project where Fleet Server keeps throwing 500 on agent checkin because the agent has upgrade details but the action id they refer to is not found.
How does this PR solve the problem?
Modify
Checkin.processUpgradeDetails
so that checkin does not fail in this scenario and log a warning instead.