Don’t Panic When Kubernetes Master Failed

It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.

$ kops rolling-update cluster --yes
...
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
...
I1024 08:52:50.388672   16009 instancegroups.go:188] Validating the cluster.
...
I1024 08:58:22.725713   16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749   16009 instancegroups.go:193] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!

I manually updated the api and api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.

🙂

Run Google Lighthouse in Docker Container

Thanks to my Colleague Simon’s suggestion, I was introduced to Google Lighthouse, an opensource nodejs framework to use Google Chrome to audit a website’s performance.

I like Lighthouse because:

  • opensource
  • good portability
  • can run as CLI command or as a nodejs module

Here’s a sample Dockerfile to have a container ready to run Lighthouse with Google Chrome for Linux.

FROM debian:stretch

USER root
WORKDIR /root
ENV CHROME_VERSION="google-chrome-stable"

# system packages
RUN apt update -qqy && \
  apt install -qqy build-essential gnupg wget curl jq

# nodejs 10
RUN curl -sL https://deb.nodesource.com/setup_10.x | bash - && \
  apt install -qqy nodejs && \
  npm install -g lighthouse

# google-chrome
RUN wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - && \
  echo "deb http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list && \
  apt update -qqy && \
  apt install -qqy ${CHROME_VERSION:-google-chrome-stable}

# python3 (optional for metric processing)
RUN apt install -qqy python3 python3-pip && \
  pip3 install influxdb

# lighthouse
RUN useradd -ms /bin/bash lighthouse
USER lighthouse
WORKDIR /home/lighthouse

Then lighthouse can be executed in the container to audit $url:

CHROME_PATH=$(which google-chrome) lighthouse $url --emulated-form-factor=none --output=json --chrome-flags="--headless --no-sandbox"

The result json will be sent to stdout, and it can be easily piped to other scripts for post processing, eg. parse json and extract metrics, etc…

🙂

Playing with Kubernetes Ingress Controller

It’s very very easy to use Kubernetes(K8s) to provision an external service with AWS ELB, there’s one catch though(at least for now in 2018).

AWS ELB is usually used with an auto scaling group and a launch configuration. However with K8s, EC2 instances won’t get spun directly, only pods will, which is call Horizontal Scaling. K8s will issue AWS API calls to update the ELBs so there’s no need for auto scaling groups or launch configurations.

This worked like a charm until when things got busy. There was a brief down time on one of the ELBs managed by K8s, because all instances at the back of the ELB were marked as unhealthy! Of course they were healthy at that moment. With help from AWS Support team, the culprit seems to be similar to this case: https://github.com/kubernetes/kubernetes/issues/47067.

Luckily for me, I had a gut feel that the simple ELB implementation isn’t the best practice and started to adopt the K8s Ingress Controller. And in this case I believe ingress can avoid the down time because the routing is done internally in K8s cluster which doesn’t involving AWS API calls. Nonetheless ingress can use 1 ELB for many apps and that’s good because ELBs are expensive.

Here are steps to deploy an nginx ingress controller as an http(L7) load balancer:

Deploy the mandatory schema, the default replica number for the controller is 2, I changed it to 3 to have 1 in each availability zone:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/mandatory.yaml

Some customisation for L7 load balancer on AWS, remember to use your SSL cert if you need https termination:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/provider/aws/service-l7.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/provider/aws/patch-configmap-l7.yaml

Then an ingress for an app can be deployed:

$ cat .k8s/prod/ingress.yaml 
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: my-ingress
  namespace: my-prod
  annotations:
    kubernetes.io/ingress.class: prod
spec:
  rules:
    - host: my.domain.elb
      http:
        paths:
          - path: /
            backend:
              serviceName: my-service
              servicePort: 80
    - host: my.domain.cdn
      http:
        paths:
          - path: /
            backend:
              serviceName: my-service
              servicePort: 80

Notes:

  • my-service is a common NodePort service and has port 80 exposed
  • io/ingress.class is for multiple ingress controllers in same k8s cluster, eg. 1 for dev and the other for prod
  • for now I have to duplicate the host block for each domain, because wildcard or regex are not supported by k8s ingress specification
  • at last, find the ELB this ingress controller created, then point my.domain.elb to it, then the CDN domain can use my.domain.elb as origin.

🙂