A Load Balancer for Cloud SQL Replicas


TL;DR: As of Aug 2022, there’s no out of box load balancing for Google Cloud SQL read replicas. So I built one for MySQL replicas with HAProxy with the ability to dynamically reload when number of replicas has changed.

My design of the HAProxy pod in Kubernetes

It will be quite straight forward to run a few HAProxy pods in Kubernetes as a service. To seamlessly reload HAProxy pods without dropping MySQL connections I came out with the design above, using 2 sidecars and a shared volume to do the job.

The gcloud sidecar

The official gcloud container image(gcr.io/google.com/cloudsdktool/google-cloud-cli) is used with the below entry-point script, to check replica instances and generate config file for HAProxy:

#!/bin/bash
# rather fail than to generate a broken config
set -eu

function generate_config {
# template for HAProxy config
cat <<EOF
global
    stats socket /var/run/haproxy.sock mode 600 expose-fd listeners level user

defaults
    mode tcp
    timeout client 3600s
    timeout connect 5s
    timeout server 3600s

frontend stats
    bind 0.0.0.0:8404
    mode http
    stats enable
    stats uri /stats
    stats refresh 10s
    http-request use-service prometheus-exporter if { path /metrics }

frontend mysql
    bind 0.0.0.0:3306
    default_backend mysql_replicas

backend mysql_replicas
    balance leastconn
EOF
    # list cloud sql replicas and format the output as HAProxy backends
    /google-cloud-sdk/bin/gcloud sql instances list -q \
        --filter="labels.host-class=replica" \
        --format='value(name, ipAddresses.ipAddress)' \
            |while read instance_name instance_ip; do \
                echo "    server $instance_name $instance_ip:3306 check"
            done
}

# a loop to check changes continously
while true; do
    echo "Generating a haproxy config file..."
    generate_config > $HAPROXY_CONFIG_FILE
    echo "Waiting for $RELOAD_INTERVAL seconds..."
    sleep $RELOAD_INTERVAL
done

The HAProxy sidecar

I use the official HAProxy container image to check and validate configuration changes, and if all pass then reload the main HAProxy container. Here’s the entry-point script:

#!/bin/sh
# rather to let it fail than doing something strange
set -eu

md5_file=${HAPROXY_CONFIG_FILE}.md5

function test_config_file {
    # if md5 checksum exists and it matche the config file
    # then the config file is considered unchanged
    if [ -f $md5_file ] && diff <(md5sum $HAPROXY_CONFIG_FILE) $md5_file; then
        return 1
    fi
    # otherwise the config file has been updated and need to be validated
    /usr/local/sbin/haproxy -f $HAPROXY_CONFIG_FILE -c
    return $?
}

while true; do
    if test_config_file; then
        echo "Configuration has been updated and validated..."
        # update the checksum
        md5sum $HAPROXY_CONFIG_FILE > $md5_file
        echo "Reloading the HAProxy..."
        pkill -HUP haproxy
    fi
    echo "Waiting for $RELOAD_INTERVAL seconds..."
    sleep $RELOAD_INTERVAL
done

All together

note: I didn’t include the configMap containing the 2 scripts above, nor the service and service account resources.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: cloudsql-haproxy
  name: cloudsql-haproxy
  namespace: cloudsql
spec:
  replicas: 2
  selector:
    matchLabels:
      app: cloudsql-haproxy
  template:
    metadata:
      annotations:
        # for prometheus scraping
        prometheus.io/port: "9102"
        prometheus.io/scrape: "true"
      labels:
        app: cloudsql-haproxy
    spec:
      containers:
      - image: haproxy:2.6.2-alpine
        name: cloudsql-haproxy
        ports:
        - containerPort: 3306
          name: mysql
          protocol: TCP
        - containerPort: 8404
          name: stats
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /stats
            port: 8404
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 500m
            memory: 200Mi
        volumeMounts:
        - mountPath: /usr/local/etc/haproxy
          name: config-dir
      - command:
        - /usr/local/bin/haproxy/config-gen-loop.sh
        env:
        - name: HAPROXY_CONFIG_FILE
          value: /usr/local/etc/haproxy/haproxy.cfg
        - name: RELOAD_INTERVAL
          value: "300"
        image: gcr.io/google.com/cloudsdktool/google-cloud-cli:397.0.0-alpine
        name: gcloud
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - mountPath: /usr/local/etc/haproxy
          name: config-dir
        - mountPath: /usr/local/bin/haproxy
          name: haproxy-scripts
      - command:
        - /usr/local/bin/haproxy/config-reloader-loop.sh
        env:
        - name: HAPROXY_CONFIG_FILE
          value: /usr/local/etc/haproxy/haproxy.cfg
        - name: RELOAD_INTERVAL
          value: "20"
        image: haproxy:2.6.2-alpine
        name: reloader
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
        volumeMounts:
        - mountPath: /usr/local/etc/haproxy
          name: config-dir
        - mountPath: /usr/local/bin/haproxy
          name: haproxy-scripts
      # gcloud container runs as root by default, so it's easier to let haproxy run as root as well
      securityContext:
        runAsUser: 0
      # this service account needs the cloudsql.viewer IAM role
      serviceAccount: haproxy
      serviceAccountName: haproxy
      # important line to allow inter-container process signaling
      shareProcessNamespace: true
      volumes:
      - configMap:
          # ensure the scripts are executable
          defaultMode: 0550
          name: haproxy-scripts
        name: haproxy-scripts
      - emptyDir: {}
        name: config-dir

🙂