Fixed CoreDNS High CPU Issue in a Kubernetes Cluster


The master node got hammered really hard after 8PM
Handy Slack notification

There was a Grafana alert saying that CPU usage was quite high on the master node of my garage Kubernetes cluster. I was watching a movie so I didn’t jump into this right away 🙂 I had a look at the master node today and this is how I fixed this issue.

With the good old SSH, I saw the CoreDNS process running on the master node chewing off big chunk of CPU time. From the CoreDNS pod’s logs( klogs is a handy bash function to tail a pod’s logs ):

klogs
.:53
[INFO] plugin/reload: Running configuration MD5 = 33fe3fc0f84bf45fc2e8e8e9701ce653
CoreDNS-1.8.0
linux/arm64, go1.15.3, 054c9ae
[ERROR] plugin/errors: 2 zipkin.istio-system. A: dial tcp 192.168.1.254:53: connect: connection refused
[ERROR] plugin/errors: 2 zipkin.istio-system. A: dial tcp 192.168.1.254:53: connect: connection refused
[ERROR] plugin/errors: 2 zipkin.istio-system. A: dial tcp 192.168.1.254:53: connect: connection refused
[ERROR] plugin/errors: 2 zipkin.istio-system. A: dial tcp 192.168.1.254:53: connect: connection refused

From the error message an Istio plugin called zipkin was querying CoreDNS but CoreDNS couldn’t reach its upstream(192.168.1.254). I tested DNS query in the node with

dig raynix.info @192.168.1.254
...
;; Query time: 27 msec
;; SERVER: 192.168.1.254#53(192.168.1.254)
;; WHEN: Thu Oct 28 14:45:15 AEDT 2021
;; MSG SIZE  rcvd: 72	

It works in a single attempt but perhaps failed because there were too many requests at the same time. 192.168.1.254 is just a consumer grade router, it’s fair that it could not handle a lot of requests. To test this assumption, I’ll need to update CoreDNS’ configuration.

From this official document all I need to do is to change the forward directive in the configuration. I changed the default upstream to Google DNS:

# kubectl edit cm coredns -n kube-system
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health {
          lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          ttl 30
        }
        prometheus :9153
        # this was forward . /etc/resolv.conf
        forward . 8.8.8.8 8.8.4.4
        cache 60
        loop
        reload
        loadbalance
    }
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system

The running CoreDNS pods won’t take this change automatically, I gave them a quick restart:

kubectl rollout restart deploy coredns -n kube-system

Boom! Problem fixed. Also the calico-kube-controllers is no longer restarting, as you can see, if CPU is flat out, some pods can even fail its health check and as a result making CPU even more busy.

🙂