Recently I encountered some errors in a GKE cluster where a lot of pods were stuck at crash loop backoff
state, which means the pods couldn’t recover on their own. When taking a closer look, I saw errors like:
Caused by: java.net.UnknownHostException: mysql.svc.cluster.local
It’s a DNS issue then. However when I created a pod and ran some DNS tests the kube-dns worked just fine. I think here are some 1-liners very handy and worth noting
# run a ubuntu pod in the selected node and doing nothing k run raytest --image=ubuntu --overrides='{"spec": {"nodeName": "gke-xxxx-xxxx"}}' -- sleep infinite # get an interactive shell of the pod k exec -ti raytest -- bash # inside the pod, run some tests apt-get update && apt-get install -y dnsutils dig mysql.svc.cluster.local
So why the Java app couldn’t resolve this DNS name? There’s 1 important difference which is that the Java app had istio-proxy
sidecar but my ubuntu
pod didn’t. So I looked into the istio-proxy
‘s logs
{"level":"warn","time":"2024-01-31T23:52:39.155016Z","msg":"cannot reach the Google Instance metadata endpoint dial tcp 169.254.169.254:80: i/o timeout"} {"level":"error","time":"2024-01-31T23:52:39.157050Z","msg":"failed to initialize envoy agent: failed to generate bootstrap config: unable to process Stackdriver tracer: missing GCP Project"}
Then I verified this timeout in my ubuntu pod too
root@raytest:/# curl http://metadata.google.internal curl: (28) Failed to connect to metadata.google.internal port 80 after 129973 ms: Connection timed out
So the stack of errors looks like: timeout to metadata service –> istio proxy can’t start –> Java app doesn’t have network connectivity. But how to fix this? I have no idea however it’s much simpler just to delete the problematic node and let a new node replace it.
# this 1-liner will list the node with most pods crashing on it k get pods -A -o wide |grep Crash |awk '{print $(NF-2) }' |sort |uniq -c |sort -nr # then drain the node on the top and delete it k drain --delete-emptydir-data --ignore-daemonsets <node name> k delete node <node name>
🙂