Solved: Google Managed Prometheus Kept Crashing


Context: I use Google Cloud Managed Service for Prometheus(GMP I call it for convenience) as the central piece of my observability stack. In a nut shell it’s Prometheus managed by Google. GMP runs a collector Prometheus pod in each node, as a DaemonSet. The collector Prometheus scrapes metrics within the node and forwards them to the main Prometheus cluster.

Recently, 1 of the collector constantly kept crashing which caused loss of metrics on the node where this collector resides. So I took a look at the collector:

$ k get pods
...
collector-dwbnq                  1/2     CrashLoopBackOff   16 (78s ago)     57m
...

It’s definitely struggling. To see what’s really going, k describe command usually helps.

$ k describe pod collector-dwbnq
...
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled

So it ran out of memory, or it would use more memory but was not allowed by its resource allocation. From my past experience a Prometheus instance consumes more memory when it scrapes more metrics – seems like a no-brainer, right? But why this one got too many metrics on its plate?

I listed all the pods running in the same node, now it looked pretty obvious – quite a few prometheus-exporter pods happened to run in the node. And each of them would yield thousands of metrics. Practically this node became a hotspot of metrics. How to de-hotspot? I’ve managed to use PodAntiAffinity to distribute pods across the cluster so this will be another good use of it.

The problem has not been solved immediately once the prometheus-exporter pods got scattered to other nodes though, because the collector Prometheus still kept crashing. Why? I solved this kind of issues before and there’s a thing called WAL(write-ahead logs) which will be processed after a restart. In this case it is the problem – the instance went OOM when processing WAL from previous restart… So I had to drain and delete the node to let it start fresh as deleting the pod didn’t work.

🙂