Unable to scale workload control plane on vSphere #836

cormachogan · 2021-07-05T09:11:07Z

Bug Report

Deployed a dev workload cluster on vSphere - 1 control plane, 1 worker node.
Scaled worker nodes from 1 to 3. Success!
Scaled control plane nodes from 1 to 3. Failure!

What I observed was as follows:

Second control plane node is successfully cloned, powered on and receives an IP address successfully via DHCP.
Original control plane node seems to lose its network information (both VM IP and VIP for K8s API server) - observed in in vSphere client UI
K8s API server no longer reachable via kubectl commands
CPU Usage on original control plane node/VM triggers vSphere alarm (4.774GHz Used)

Switched to management cluster to look at some logs:

% kubectl logs capi-kubeadm-control-plane-controller-manager-5596569b-q6rxz -n capi-kubeadm-control-plane-system manager
I0705 08:24:34.446929       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0705 08:24:34.923501       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:35.396995       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:24:35.399412       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-gpgp6 does not have APIServerPodHealthy condition, machine workload-control-plane-gpgp6 does not have ControllerManagerPodHealthy condition, machine workload-control-plane-gpgp6 does not have SchedulerPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdMemberHealthy condition]"
.
.
.
I0705 08:26:26.708159       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:27:27.066038       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:27:27.066330       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-kvj7j reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member), machine workload-control-plane-gpgp6 reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]"
I0705 08:27:42.486017       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:27:57.078706       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
I0705 08:27:57.117486       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:28:57.168628       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:28:57.187424       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:28:57.188000       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:29:57.225857       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
E0705 08:29:57.227366       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:29:57.227913       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:30:52.225222       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:30:57.267482       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:30:57.268704       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"
I0705 08:30:57.269114       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"

% kubectl logs capi-kubeadm-control-plane-controller-manager-5596569b-q6rxz -n capi-kubeadm-control-plane-system manager
I0705 08:21:35.975933       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:24:33.802071       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:34.446929       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=1
I0705 08:24:34.923501       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:24:35.396995       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:24:35.399412       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-gpgp6 does not have APIServerPodHealthy condition, machine workload-control-plane-gpgp6 does not have ControllerManagerPodHealthy condition, machine workload-control-plane-gpgp6 does not have SchedulerPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdPodHealthy condition, machine workload-control-plane-gpgp6 does not have EtcdMemberHealthy condition]"
.
.
.
I0705 08:26:26.708159       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:27:27.066038       1 controller.go:355] controllers/KubeadmControlPlane "msg"="Scaling up control plane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "Desired"=3 "Existing"=2
I0705 08:27:27.066330       1 scale.go:206] controllers/KubeadmControlPlane "msg"="Waiting for control plane to pass preflight checks" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "failures"="[machine workload-control-plane-kvj7j reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-kvj7j reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member), machine workload-control-plane-gpgp6 reports APIServerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports ControllerManagerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports SchedulerPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdPodHealthy condition is unknown (Failed to get the node which is hosting this component), machine workload-control-plane-gpgp6 reports EtcdMemberHealthy condition is unknown (Failed to get the node which is hosting the etcd member)]"
I0705 08:27:42.486017       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="vcsa06-octoc" "kubeadmControlPlane"="vcsa06-octoc-control-plane" "namespace"="tkg-system"
I0705 08:27:57.078706       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
I0705 08:27:57.117486       1 controller.go:244] controllers/KubeadmControlPlane "msg"="Reconcile KubeadmControlPlane" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default"
I0705 08:28:57.168628       1 controller.go:182] controllers/KubeadmControlPlane "msg"="Could not connect to workload cluster to fetch status" "cluster"="workload" "kubeadmControlPlane"="workload-control-plane" "namespace"="default" "err"="failed to create remote cluster client: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded"
E0705 08:28:57.187424       1 controller.go:257] controller-runtime/controller "msg"="Reconciler error" "error"="cannot get remote client to workload cluster: default/workload: Get https://10.27.51.243:6443/api?timeout=30s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)" "controller"="kubeadmcontrolplane" "name"="workload-control-plane" "namespace"="default"

To try and regain access to the cluster, I reset (via the vSphere client) the original control plane node/VM. This allowed the node to regain its networking configuration and I could once again access the API server after it rebooted and I waited a few minutes.

However the control plane is still not reconciled:

% kubectl get nodes -o wide
NAME                            STATUS     ROLES                  AGE     VERSION            INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                 KERNEL-VERSION   CONTAINER-RUNTIME
workload-control-plane-gpgp6    NotReady   <none>                 38m     v1.20.4+vmware.1   10.27.51.25   10.27.51.25   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-control-plane-kvj7j    Ready      control-plane,master   2d20h   v1.20.4+vmware.1   10.27.51.61   10.27.51.61   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-md-0-984748884-g5884   Ready      <none>                 2d20h   v1.20.4+vmware.1   10.27.51.63   10.27.51.63   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-md-0-984748884-jtg7q   Ready      <none>                 2d20h   v1.20.4+vmware.1   10.27.51.64   10.27.51.64   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3
workload-md-0-984748884-pvlnq   Ready      <none>                 2d20h   v1.20.4+vmware.1   10.27.51.62   10.27.51.62   VMware Photon OS/Linux   4.19.174-5.ph3   containerd://1.4.3

And the kubelet status on the new node seems to have an issue with the CSI driver, but I cannot tell if this is the root cause:

% ssh capv@10.27.51.25
Last login: Mon Jul  5 08:44:34 2021 from 10.30.3.96
 08:52:54 up 27 min,  0 users,  load average: 0.30, 0.34, 0.19
tdnf update info not available yet!
capv@workload-control-plane-gpgp6 [ ~ ]$ sudo su -
root@workload-control-plane-gpgp6 [ ~ ]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
  Drop-In: /usr/lib/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Mon 2021-07-05 08:28:52 UTC; 24min ago
     Docs: https://kubernetes.io/docs/home/
 Main PID: 2941 (kubelet)
    Tasks: 16 (limit: 4714)
   Memory: 46.1M
   CGroup: /system.slice/kubelet.service
           └─2941 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cloud-provider=external --container-runtime=remote --container-runtime-endpoint=/var/run/containerd/containerd.sock --tls-ciphe
r-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 --pod-infra-container-image=projects.registry.vmware.com/tkg/paus
e:3.2

Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.152310    2941 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/csi.vsphere.vmware.com-reg.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.152321    2941 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153041    2941 csi_plugin.go:100] kubernetes.io/csi: Trying to validate a new CSI Driver with name: csi.vsphere.vmware.com endpoint: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock versions: 1.0.0
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153070    2941 csi_plugin.go:113] kubernetes.io/csi: Register new plugin with name: csi.vsphere.vmware.com at endpoint: /var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153104    2941 clientconn.go:106] parsed scheme: ""
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153113    2941 clientconn.go:106] scheme "" not registered, fallback to default scheme
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153146    2941 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins/csi.vsphere.vmware.com/csi.sock  <nil> 0 <nil>}] <nil> <nil>}
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153154    2941 clientconn.go:948] ClientConn switching balancer to "pick_first"
Jul 05 08:50:26 workload-control-plane-gpgp6 kubelet[2941]: I0705 08:50:26.153185    2941 clientconn.go:897] blockingPicker: the picked transport is not ready, loop back to repick
Jul 05 08:50:31 workload-control-plane-gpgp6 kubelet[2941]: E0705 08:50:31.970708    2941 nodeinfomanager.go:574] Invalid attach limit value 0 cannot be added to CSINode object for "csi.vsphere.vmware.com"

I have managed to repeat this scenario twice with 2 different TKG clusters on vSphere.

Expected Behavior

That the control plane would scale seamlessly.

Steps to Reproduce the Bug

Deploy a single control plane dev workload cluster
Attempt to scale the control plane to 3 nodes

Environment Details

TCE version

v0.5.0

tanzu version

version: v1.3.0
buildDate: 2021-06-03
sha: b261a8b

kubectl version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4+vmware.1", GitCommit:"d475bbd9e7cd66c6db7069cb447766daada65e3b", GitTreeState:"clean", BuildDate:"2021-02-22T22:15:46Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Operating System (client):
macOS Big Sur version 11.4

The text was updated successfully, but these errors were encountered:

cormachogan · 2021-07-05T10:06:35Z

Seems the CSI issue - Invalid attach limit value 0 cannot be added to CSINode object for "csi.vsphere.vmware.com" - is not related at this occurs on all nodes, even on a fresh deployment.

cormachogan · 2021-07-06T13:28:11Z

Seems that this is an issue that also impacts initial deployment of control planes as well. If I deploy a "dev" control plane with a single node, it comes up immediately. If I deploy a "prod" control plane, it seems to encounter the same issue as scaling.

jpmcb transferred this issue from vmware-tanzu/community-edition Oct 11, 2021

jpmcb mentioned this issue Oct 11, 2021

Unable to scale workload control plane on vSphere vmware-tanzu/community-edition#2213

Closed

1 task

vuil added kind/bug PR/Issue related to a bug area/plugin needs-severity labels Oct 19, 2021

frapposelli assigned yastij Oct 19, 2021

saji-pivotal added the not-core-cli label Aug 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to scale workload control plane on vSphere #836

Unable to scale workload control plane on vSphere #836

cormachogan commented Jul 5, 2021

cormachogan commented Jul 5, 2021

cormachogan commented Jul 6, 2021

Unable to scale workload control plane on vSphere #836

Unable to scale workload control plane on vSphere #836

Comments

cormachogan commented Jul 5, 2021

Bug Report

Expected Behavior

Steps to Reproduce the Bug

Environment Details

cormachogan commented Jul 5, 2021

cormachogan commented Jul 6, 2021