Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Separate linux/windows health checker files. Build health checker plugin for Windows #544

Merged
merged 1 commit into from
Apr 26, 2021

Conversation

mcshooter
Copy link
Contributor

@mcshooter mcshooter commented Apr 14, 2021

Currently, the HealthChecker plugin only functions for linux OS. This change builds out the windows equivalent functionality of HealthChecker. Hence, when NPD is being ran on Windows nodes, we will be able to detect whether services are down. The HealthChecker will help detect whether docker, kubelet, crictl, and containerd are down and will attempt to repair the services if the HealthChecker is configured to repair the services.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 14, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @mcshooter. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 14, 2021
@mcshooter
Copy link
Contributor Author

/cc jeremyje

@k8s-ci-robot k8s-ci-robot requested a review from jeremyje April 14, 2021 22:20
cmd/healthchecker/options/options_windows.go Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker_windows.go Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker_windows.go Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker_windows.go Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker_windows.go Outdated Show resolved Hide resolved
@jeremyje
Copy link
Contributor

/sig windows
/ok-to-test
/priority important-soon

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/windows Categorizes an issue or PR as relevant to SIG Windows. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 15, 2021
@jeremyje
Copy link
Contributor

@mcshooter Can you edit the description of the change to explain why this change is necessary?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 15, 2021
@mcshooter mcshooter changed the title Separate linux/windows health checker files. Build out windows Health Checker Plugin Separate linux/windows health checker files. Build health checker plugin for Windows Apr 15, 2021
@mcshooter mcshooter force-pushed the buildWindowsHealthChecker branch from 4e7fa06 to da5fdc1 Compare April 16, 2021 17:53
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 16, 2021
@Random-Liu
Copy link
Member

/cc @liyanhui1228

@k8s-ci-robot
Copy link
Contributor

@Random-Liu: GitHub didn't allow me to request PR reviews from the following users: liyanhui1228.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @liyanhui1228

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

@liyanhui1228 liyanhui1228 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one comment.

pkg/healthchecker/health_checker_windows.go Show resolved Hide resolved
@mcshooter mcshooter force-pushed the buildWindowsHealthChecker branch 2 times, most recently from 687d2e0 to dad581d Compare April 19, 2021 23:12
@mcshooter
Copy link
Contributor Author

I also added the respective windows health checker config files @jeremyje @liyanhui1228

cmd/healthchecker/options/options.go Show resolved Hide resolved
// getUptimeFunc returns the time for which the given service has been running.
func getUptimeFunc(service string) func() (time.Duration, error) {
return func() (time.Duration, error) {
// Using the WinEvent Log Objects to find the Service logs' time when the Service was last stopped/terminated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for the first run of the service, and no stopped/terminated log exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only get executed when the the service detects it is unhealthy. So if the service runs and there are no logs that it detects that do not indicate the node is unhealthy, then this will not get executed. In other words, nothing happens yet, until some issue is detected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service unhealthy != service goes down.

It is possible that docker is not healthy but still running, in that case will this function keep returning error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as docker.exe ps does not return an error, the uptime function will not get executed. We test the heath of docker by executing docker and running a ps command. But, as long as docker.exe ps returns an error, then yes, this function will keep returning an error until it is repaired successfully. Does that answer what you are trying to ask?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uptime would always be 0?

How could it be greater than hc.coolDownTime, and will the repair ever happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for some reason docker fails to execute docker.exe ps and is still actually running, healthchecker will indicate that the it is unhealthy and then yes, the uptime function will always be 0 because it did not find when docker actually stopped. Then in that case, the repair won't happen.

But we want the repair to happen in that case I think. If the repair won't happen in that case, it seems like a bug to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not quite sure how I would determine docker has entered that state. Also, I am just porting over the same way it is done in Linux. Would it make sense to create another work item to address this for both linux and windows?

Copy link
Member

@Random-Liu Random-Liu Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The systemd timestamp we use reflects the service start time:

The InactiveExitTimestamp tracks when a particular systemd unit transitions from the Inactive to Active state, which can be used to mark the beginning of systemd’s activation of cloud-init.

You should use the service start time instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. Let me make those changes to get the last start time instead of the last stop/terminate time

Copy link
Contributor Author

@mcshooter mcshooter Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if the change completely addresses the original issue concern, but now, if docker.exe ps doesn't work, but it's still running, when the uptime function queries the time, it will look for the last running start time. So, in that case, it should attempt to repair the service.

pkg/healthchecker/health_checker_windows.go Show resolved Hide resolved
@mcshooter mcshooter force-pushed the buildWindowsHealthChecker branch from dad581d to bcba3af Compare April 20, 2021 04:21
@jeremyje
Copy link
Contributor

This is for #461

@mcshooter mcshooter force-pushed the buildWindowsHealthChecker branch from bcba3af to 5b466c2 Compare April 20, 2021 18:53
"The underlying systemd service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.")
// Deprecated: For backward compatibility on linux environment. Going forward "service" will be used instead of systemd-service
if runtime.GOOS == "linux" {
fs.MarkDeprecated("systemd-service", "please use --service flag instead")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated != not working any more.

systemd-service should still apply value on hco.Service if specified.

Copy link
Contributor Author

@mcshooter mcshooter Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so need both like the following?

// Deprecated: For backward compatibility on linux environment. Going forward "service" will be used instead of systemd-service
if runtime.GOOS == "linux" {
fs.MarkDeprecated("systemd-service", "please use --service flag instead")
fs.StringVar(&hco.Service, "systemd-service", "",
	"The underlying service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.")
}

fs.MarkDeprecated("systemd-service", "please use --service flag instead")
}
fs.StringVar(&hco.Service, "service", "",
"The underlying service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explain what the service means on linux and windows. And I don't think cri is a valid option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean cri is not a valid option? I think the message says to use containerd for cri?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nvm. I misread the flag description.

// getUptimeFunc returns the time for which the given service has been running.
func getUptimeFunc(service string) func() (time.Duration, error) {
return func() (time.Duration, error) {
// Using the WinEvent Log Objects to find the Service logs' time when the Service was last stopped/terminated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service unhealthy != service goes down.

It is possible that docker is not healthy but still running, in that case will this function keep returning error?

@mcshooter mcshooter force-pushed the buildWindowsHealthChecker branch 2 times, most recently from cd14c8b to 314c08f Compare April 21, 2021 18:22
@mcshooter
Copy link
Contributor Author

/retest

@mcshooter mcshooter force-pushed the buildWindowsHealthChecker branch from 314c08f to c4e5400 Compare April 26, 2021 21:45
@Random-Liu
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 26, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mcshooter, Random-Liu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 26, 2021
@k8s-ci-robot k8s-ci-robot merged commit 031e658 into kubernetes:master Apr 26, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/windows Categorizes an issue or PR as relevant to SIG Windows. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants