Recently I needed to improve an alert defined using Prometheus Alert Manager to trigger when less than half of the minimum replicas are up, eg.
number_of_available_replicas / number_of_minimum_replicas < 0.5
I had a look at existing metrics collected by kube-state-metrics in my Kubernetes cluster, the number of replicas running well could be queried with the kube_pod_status_ready
metric and the number of minimum replicas with kube_horizontalpodautoscaler_spec_min_replicas
metric.
I couldn’t simply use
sum by(some_label) (kube_pod_status_ready{namespace=~"some-pattern"}) / kube_horizontalpodautoscaler_spec_min_replicas{namespace=~"some-pattern"}
because the 2 metrics don’t share a same set of labels, Prometheus can’t do proportions. After some RtFM, here’s the expression that finally works using on
syntax:
sum by(horizontalpodautoscaler) (label_replace(kube_pod_status_ready{namespace=~"some-pattern"}, "horizontalpodautoscaler", "$1", "pod", "(.*)(-[^-]+){2}")) / on(horizontalpodautoscaler) kube_horizontalpodautoscaler_spec_min_replicas{namespace=~"some-pattern"} < %s
Some explanation:
- on(label) let Prometheus to match the 2 metrics with specific labels, very useful in this scenario
- label_replace(instant_vector, label_to_create, label_value, from_label, from_regex) adds(not quite sure why it has
replace
in its name) a new label called label_to_create with value label_value, from existing label from_label and match and capture the existing label - sum by(label) will a sum grouping by the provided label. The time series in metric
kube_pod_status_ready
has 1 when the pod is actually running, so a sum does the trick very well to count all running pods of a replica set managed by a Horizontal Pod Autoscaler(HPA)
🙂