I planned to upgrade my home-lab Kubernetes cluster from 1.32 to 1.35, using the same shortcut I used last time. Unfortunately when I took a look at the cluster which I didn’t touch for a while, I couldn’t connect to it anymore.
k get nodes The connection to the server 192.168.1.93:6443 was refused - did you specify the right host or port?
It’s not convenient at all – I can’t use kubectl to troubleshoot the cluster issue because the API server is not working for me. I’ll have to dive to a lower level – the Linux container level – to see what’s going on. The kube-apiserver pod is always running on the master node, time to warm-up my ssh skills!
ssh kube-master ... sudo -i ... # list all linux containers crictl ps -a ... aba9a230c0fa2 1c20c8797e486 24 seconds ago Exited kube-apiserver ... ... # see the logs of the container crictl logs $(crictl ps -a --name kube-apiserver -q | head -n 1) ... F0511 01:56:17.009965 1 instance.go:226] Error creating leases: error creating storage factory: open /etc/kubernetes/pki/apiserver-etcd-client.crt: no such file or directory # here we go! so the cert is missing for some reason # here's the kubeadm command to double check certs kubeadm certs check-expiration ... !MISSING! apiserver-etcd-client ... # add the missing cert kubeadm init phase certs apiserver-etcd-client [certs] Generating "apiserver-etcd-client" certificate and key # restart all containers just in case systemctl restart kubelet # check containers again crictl ps -a ... f8c8c3465461b 1c20c8797e486 3 seconds ago Running kube-apiserver ... # oh yeah!
It turned out that 1 missing cert can cut down all communications with a Kubernetes cluster 🙂 Luckily all workloads weren’t affected in worker nodes.
