Skip to content
This repository has been archived by the owner on Oct 10, 2023. It is now read-only.

Unable to scale workload control plane on vSphere #836

Open
Tracked by #2213
cormachogan opened this issue Jul 5, 2021 · 2 comments
Open
Tracked by #2213

Unable to scale workload control plane on vSphere #836

cormachogan opened this issue Jul 5, 2021 · 2 comments
Assignees

Comments

@cormachogan
Copy link

Bug Report

Deployed a dev workload cluster on vSphere - 1 control plane, 1 worker node.
Scaled worker nodes from 1 to 3. Success!
Scaled control plane nodes from 1 to 3. Failure!

What I observed was as follows:

  • Second control plane node is successfully cloned, powered on and receives an IP address successfully via DHCP.
  • Original control plane node seems to lose its network information (both VM IP and VIP for K8s API server) - observed in in vSphere client UI
  • K8s API server no longer reachable via kubectl commands
  • CPU Usage on original control plane node/VM triggers vSphere alarm (4.774GHz Used)

Switched to management cluster to look at some logs:

% kubectl logs capi-kubeadm-control-plane-controller-manager-5596569b-q6rxz -n capi-kubeadm-control-plane-system manager
I0705 08:24:34.446929       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0705 08:24:34.923501       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:35.396995       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:24:35.399412       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-gpgp6 does not have APIServerPodHealthy condition, machine workload-control-plane-gpgp6 does not have ControllerManagerPodHealthy condition, machine workload-control-plane-gpgp6 does not have SchedulerPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdMemberHealthy condition]"
.
.
.
I0705 08:26:26.708159       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:27:27.066038       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:27:27.066330       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-kvj7j reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member), machine workload-control-plane-gpgp6 reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]"
I0705 08:27:42.486017       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:27:57.078706       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
I0705 08:27:57.117486       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:28:57.168628       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:28:57.187424       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:28:57.188000       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:29:57.225857       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
E0705 08:29:57.227366       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:29:57.227913       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:30:52.225222       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:30:57.267482       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:30:57.268704       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:30:57.269114       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
% kubectl logs capi-kubeadm-control-plane-controller-manager-5596569b-q6rxz -n capi-kubeadm-control-plane-system manager
I0705 08:21:35.975933       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:24:33.802071       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:34.446929       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0705 08:24:34.923501       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:35.396995       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:24:35.399412       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-gpgp6 does not have APIServerPodHealthy condition, machine workload-control-plane-gpgp6 does not have ControllerManagerPodHealthy condition, machine workload-control-plane-gpgp6 does not have SchedulerPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdMemberHealthy condition]"
.
.
.
I0705 08:26:26.708159       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:27:27.066038       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:27:27.066330       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-kvj7j reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member), machine workload-control-plane-gpgp6 reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]"
I0705 08:27:42.486017       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:27:57.078706       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
I0705 08:27:57.117486       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:28:57.168628       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:28:57.187424       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"

To try and regain access to the cluster, I reset (via the vSphere client) the original control plane node/VM. This allowed the node to regain its networking configuration and I could once again access the API server after it rebooted and I waited a few minutes.

However the control plane is still not reconciled:

% kubectl get nodes -o wide
NAME                            STATUS     ROLES                  AGE     VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION   CONTAINER-RUNTIME
workload-control-plane-gpgp6    NotReady   <none>                 38m     v1.20.4+vmware.1   10.27.51.25   10.27.51.25   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-control-plane-kvj7j    Ready      control-plane,master   2d20h   v1.20.4+vmware.1   10.27.51.61   10.27.51.61   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-md-0-984748884-g5884   Ready      <none>                 2d20h   v1.20.4+vmware.1   10.27.51.63   10.27.51.63   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-md-0-984748884-jtg7q   Ready      <none>                 2d20h   v1.20.4+vmware.1   10.27.51.64   10.27.51.64   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-md-0-984748884-pvlnq   Ready      <none>                 2d20h   v1.20.4+vmware.1   10.27.51.62   10.27.51.62   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3

And the kubelet status on the new node seems to have an issue with the CSI driver, but I cannot tell if this is the root cause:

% ssh capv@10.27.51.25
Last login: Mon Jul  5 08:44:34 2021 from 10.30.3.96
 08:52:54 up 27 min,  0 users,  load average: 0.30, 0.34, 0.19
tdnf update info not available yet!
capv@workload-control-plane-gpgp6 [ ~ ]$ sudo su -
root@workload-control-plane-gpgp6 [ ~ ]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Mon 2021-07-05 08:28:52 UTC; 24min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 2941 (kubelet)
    Tasks: 16 (limit: 4714)
   Memory: 46.1M
   CGroup: /system.slice/kubelet.service
           └─2941 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cloud-provider=external --container-runtime=remote --container-runtime-endpoint=/var/run/containerd/containerd.sock --tls-ciphe
r-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 --pod-infra-container-image=projects.registry.vmware.com/tkg/paus
e:3.2

Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.152310    2941 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/csi.vsphere.vmware.com-reg.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.152321    2941 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153041    2941 csi_plugin.go:100] kubernetes.io/csi: Trying to validate a new CSI Driver with name: csi.vsphere.vmware.com endpoint: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock versions: 1.0.0
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153070    2941 csi_plugin.go:113] kubernetes.io/csi: Register new plugin with name: csi.vsphere.vmware.com at endpoint: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153104    2941 clientconn.go:106] parsed scheme: ""
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153113    2941 clientconn.go:106] scheme "" not registered, fallback to default scheme
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153146    2941 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153154    2941 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153185    2941 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
Jul 05 08:50:31 workload-control-plane-gpgp6 kubelet[2941]: E0705 08:50:31.970708    2941 nodeinfomanager.go:574] Invalid attach limit value 0 cannot be added to CSINode object for "csi.vsphere.vmware.com"

I have managed to repeat this scenario twice with 2 different TKG clusters on vSphere.

Expected Behavior

That the control plane would scale seamlessly.

Steps to Reproduce the Bug

  1. Deploy a single control plane dev workload cluster
  2. Attempt to scale the control plane to 3 nodes

Environment Details

  • TCE version

v0.5.0

  • tanzu version
version: v1.3.0
buildDate: 2021-06-03
sha: b261a8b
  • kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4+vmware.1", GitCommit:"d475bbd9e7cd66c6db7069cb447766daada65e3b", GitTreeState:"clean", BuildDate:"2021-02-22T22:15:46Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
  • Operating System (client):
    macOS Big Sur version 11.4
@cormachogan
Copy link
Author

Seems the CSI issue - Invalid attach limit value 0 cannot be added to CSINode object for "csi.vsphere.vmware.com" - is not related at this occurs on all nodes, even on a fresh deployment.

@cormachogan
Copy link
Author

Seems that this is an issue that also impacts initial deployment of control planes as well. If I deploy a "dev" control plane with a single node, it comes up immediately. If I deploy a "prod" control plane, it seems to encounter the same issue as scaling.

@jpmcb jpmcb transferred this issue from vmware-tanzu/community-edition Oct 11, 2021
@vuil vuil added kind/bug PR/Issue related to a bug area/plugin needs-severity labels Oct 19, 2021
# for free to subscribe to this conversation on GitHub. Already have an account? #.
Projects
None yet
Development

No branches or pull requests

4 participants