-
Notifications
You must be signed in to change notification settings - Fork 643
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Separate linux/windows health checker files. Build health checker plugin for Windows #544
Separate linux/windows health checker files. Build health checker plugin for Windows #544
Conversation
Hi @mcshooter. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc jeremyje |
/sig windows |
@mcshooter Can you edit the description of the change to explain why this change is necessary? |
4e7fa06
to
da5fdc1
Compare
/cc @liyanhui1228 |
@Random-Liu: GitHub didn't allow me to request PR reviews from the following users: liyanhui1228. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one comment.
687d2e0
to
dad581d
Compare
I also added the respective windows health checker config files @jeremyje @liyanhui1228 |
// getUptimeFunc returns the time for which the given service has been running. | ||
func getUptimeFunc(service string) func() (time.Duration, error) { | ||
return func() (time.Duration, error) { | ||
// Using the WinEvent Log Objects to find the Service logs' time when the Service was last stopped/terminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens for the first run of the service, and no stopped/terminated log exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only get executed when the the service detects it is unhealthy. So if the service runs and there are no logs that it detects that do not indicate the node is unhealthy, then this will not get executed. In other words, nothing happens yet, until some issue is detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Service unhealthy != service goes down.
It is possible that docker is not healthy but still running, in that case will this function keep returning error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long as docker.exe ps
does not return an error, the uptime function will not get executed. We test the heath of docker by executing docker and running a ps
command. But, as long as docker.exe ps
returns an error, then yes, this function will keep returning an error until it is repaired successfully. Does that answer what you are trying to ask?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uptime
would always be 0?
How could it be greater than hc.coolDownTime
, and will the repair ever happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If for some reason docker fails to execute docker.exe ps and is still actually running, healthchecker will indicate that the it is unhealthy and then yes, the uptime function will always be 0 because it did not find when docker actually stopped. Then in that case, the repair won't happen.
But we want the repair to happen in that case I think. If the repair won't happen in that case, it seems like a bug to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not quite sure how I would determine docker has entered that state. Also, I am just porting over the same way it is done in Linux. Would it make sense to create another work item to address this for both linux and windows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The systemd timestamp we use reflects the service start time:
The InactiveExitTimestamp tracks when a particular systemd unit transitions from the Inactive to Active state, which can be used to mark the beginning of systemd’s activation of cloud-init.
You should use the service start time instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. Let me make those changes to get the last start time instead of the last stop/terminate time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if the change completely addresses the original issue concern, but now, if docker.exe ps
doesn't work, but it's still running, when the uptime function queries the time, it will look for the last running start time. So, in that case, it should attempt to repair the service.
dad581d
to
bcba3af
Compare
This is for #461 |
bcba3af
to
5b466c2
Compare
"The underlying systemd service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.") | ||
// Deprecated: For backward compatibility on linux environment. Going forward "service" will be used instead of systemd-service | ||
if runtime.GOOS == "linux" { | ||
fs.MarkDeprecated("systemd-service", "please use --service flag instead") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated
!= not working any more
.
systemd-service
should still apply value on hco.Service
if specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so need both like the following?
// Deprecated: For backward compatibility on linux environment. Going forward "service" will be used instead of systemd-service
if runtime.GOOS == "linux" {
fs.MarkDeprecated("systemd-service", "please use --service flag instead")
fs.StringVar(&hco.Service, "systemd-service", "",
"The underlying service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.")
}
fs.MarkDeprecated("systemd-service", "please use --service flag instead") | ||
} | ||
fs.StringVar(&hco.Service, "service", "", | ||
"The underlying service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should explain what the service means on linux and windows. And I don't think cri
is a valid option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean cri is not a valid option? I think the message says to use containerd for cri?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh nvm. I misread the flag description.
// getUptimeFunc returns the time for which the given service has been running. | ||
func getUptimeFunc(service string) func() (time.Duration, error) { | ||
return func() (time.Duration, error) { | ||
// Using the WinEvent Log Objects to find the Service logs' time when the Service was last stopped/terminated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Service unhealthy != service goes down.
It is possible that docker is not healthy but still running, in that case will this function keep returning error?
cd14c8b
to
314c08f
Compare
/retest |
314c08f
to
c4e5400
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mcshooter, Random-Liu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Currently, the HealthChecker plugin only functions for linux OS. This change builds out the windows equivalent functionality of HealthChecker. Hence, when NPD is being ran on Windows nodes, we will be able to detect whether services are down. The HealthChecker will help detect whether docker, kubelet, crictl, and containerd are down and will attempt to repair the services if the HealthChecker is configured to repair the services.