-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Unhealthy controller manager and scheduler after leaving it running overnight #14036
Comments
The docs state that starting with a small cluster requires 2 vCPUs and 4GB RAM, you should provision it uses these specifications and try again. For the next time, please include the logging generated by the cluster components so we can take a look what is happening in your cluster. |
You're right, I thought that requirement was for the rancher server but the documentation clearly states that it's for the nodes. I upgraded to the 2cpu 4gb drops and increased the etcd/control nodes to 3 nodes and it hasn't happened again. |
We experienced the same issue, albeit with the required 2 vCPUs and 4GB RAM.
The containers This happenend yesterday, 2018-06-28 at around 12:00+0200.
|
@apricote You can try upgrading to rancher 2.0.4. Rancher 2.0.2 used control nodes as worker nodes to schedule workload pods. That could cause the etcd/control node to run out of memory. |
Upgrade is scheduled for next week. Edit: Restarting the node solved the issue. |
@apricote I restarted all the worker nodes. it worked. no idea when this will happen again |
the same issue |
I keep having the same issue, although it is not an "overnight" one, but it can happen without (apparent) reason. Another problem I face is that I couldn't find a way to restart the unhealthy parts (Controller Manager and Scheduler). I always had to recreate a whole new cluster, do you know if it is possible to restart the unhealthy components only? |
Can this be re-opened? I am curious why it was closed with no resolution? |
I did restart the etcd container on each host that has the unhealthy error and that clears the alert. |
that would be a bad solution @matfiz. I have the same issue. one node became unhealthy overnight on a cluster I don't use. |
This issue was closed because the node specs were not up to par with the requirements. After that there have been some replies indicating scheduler failing, kube-controller-manager failing, both failing or the node not responding. Without info requested from the template there is no way to investigate what is going on or if its possibly fixed in newer versions. Regarding scheduler only, it does not exit automatically on failure, this will be fixed in k8s 1.16 (kubernetes/kubernetes#81306). There is another issue with using If you encounter issues similar to this, please file a new issue with all the requested info so we can look into it and investigate. In case of the screenshot above, your smallest node is running the most important components of the cluster which on paper may seem enough, but depending on the amount of resources (deployments/events etc) this might not be enough, without this info it's hard to determine. Providing full logs of all components involved usually helps as it will give an indication when the issue started to appear, and in case of a resource problem, system logging and metrics will also help in the investigation. |
rancher/rancher: 2.0.2 single node
provider: Digital Ocean (1 etcd/control node, 3 worker nodes, all 1GB Ram 1 vcpu 25GB disk, ubuntu 16.04 x64)
Steps to Reproduce:
The text was updated successfully, but these errors were encountered: