A GKE Issue: Metadata Service Is Unreachable In Nodes


Recently I encountered some errors in a GKE cluster where a lot of pods were stuck at crash loop backoff state, which means the pods couldn’t recover on their own. When taking a closer look, I saw errors like:

Caused by: java.net.UnknownHostException: mysql.svc.cluster.local

It’s a DNS issue then. However when I created a pod and ran some DNS tests the kube-dns worked just fine. I think here are some 1-liners very handy and worth noting

# run a ubuntu pod in the selected node and doing nothing
k run raytest --image=ubuntu --overrides='{"spec": {"nodeName": "gke-xxxx-xxxx"}}' -- sleep infinite

# get an interactive shell of the pod
k exec -ti raytest -- bash

# inside the pod, run some tests
apt-get update && apt-get install -y dnsutils
dig mysql.svc.cluster.local

So why the Java app couldn’t resolve this DNS name? There’s 1 important difference which is that the Java app had istio-proxy sidecar but my ubuntu pod didn’t. So I looked into the istio-proxy‘s logs

{"level":"warn","time":"2024-01-31T23:52:39.155016Z","msg":"cannot reach the Google Instance metadata endpoint dial tcp 169.254.169.254:80: i/o timeout"}
{"level":"error","time":"2024-01-31T23:52:39.157050Z","msg":"failed to initialize envoy agent: failed to generate bootstrap config: unable to process Stackdriver tracer: missing GCP Project"}

Then I verified this timeout in my ubuntu pod too

root@raytest:/# curl http://metadata.google.internal
curl: (28) Failed to connect to metadata.google.internal port 80 after 129973 ms: Connection timed out

So the stack of errors looks like: timeout to metadata service –> istio proxy can’t start –> Java app doesn’t have network connectivity. But how to fix this? I have no idea however it’s much simpler just to delete the problematic node and let a new node replace it.

# this 1-liner will list the node with most pods crashing on it
k get pods -A -o wide |grep Crash |awk '{print $(NF-2) }' |sort |uniq -c |sort -nr
# then drain the node on the top and delete it
k drain --delete-emptydir-data --ignore-daemonsets <node name>
k delete node <node name>

🙂