Context: I use Google Cloud Managed Service for Prometheus(GMP I call it for convenience) as the central piece of my observability stack. In a nut shell it’s Prometheus managed by Google. GMP runs a collector Prometheus pod in each node, as a DaemonSet. The collector Prometheus scrapes metrics within the node and forwards them to the main Prometheus cluster.
Recently, 1 of the collector constantly kept crashing which caused loss of metrics on the node where this collector resides. So I took a look at the collector:
$ k get pods ... collector-dwbnq 1/2 CrashLoopBackOff 16 (78s ago) 57m ...
It’s definitely struggling. To see what’s really going, k describe
command usually helps.
$ k describe pod collector-dwbnq ... State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled
So it ran out of memory, or it would use more memory but was not allowed by its resource allocation. From my past experience a Prometheus instance consumes more memory when it scrapes more metrics – seems like a no-brainer, right? But why this one got too many metrics on its plate?
I listed all the pods running in the same node, now it looked pretty obvious – quite a few prometheus-exporter pods happened to run in the node. And each of them would yield thousands of metrics. Practically this node became a hotspot of metrics. How to de-hotspot? I’ve managed to use PodAntiAffinity to distribute pods across the cluster so this will be another good use of it.
The problem has not been solved immediately once the prometheus-exporter pods got scattered to other nodes though, because the collector Prometheus still kept crashing. Why? I solved this kind of issues before and there’s a thing called WAL(write-ahead logs) which will be processed after a restart. In this case it is the problem – the instance went OOM when processing WAL from previous restart… So I had to drain and delete the node to let it start fresh as deleting the pod didn’t work.
🙂