Nicer Deployment with Kubernetes

The default strategy to do rolling update in a Kubernetes deployment is to reduce the capacity of current replica set and then add the capacity to the new replica set. This probably means total processing power for the app could be hindered a bit during the deployment.

I’m a bit surprised to find that the default strategy works this way. But luckily it’s not hard to fine tune this. According to the doc here: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment only a few lines is needed to change the strategy:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-deploy
namespace: my-project
spec:
strategy:
rollingUpdate:
maxUnavailable: 0
maxSurge: 40%
revisionHistoryLimit: 3

By maxUnavailable: 0 this means the total capacity of the deployment will not be reduced. and maxSurge: 40% means new replica can reach 40% of the capacity of the current replica set, before the current one become the old one and drained.

Not a big improvement but revisionHistoryLimit: 3 will only keep 3 replica sets for purpose to roll back a deployment. The default for this is unlimited, which is quite over provisioned, from my point of view.

🙂

Don’t Panic When Kubernetes Master Failed

It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.

$ kops rolling-update cluster --yes
...
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
...
I1024 08:52:50.388672   16009 instancegroups.go:188] Validating the cluster.
...
I1024 08:58:22.725713   16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749   16009 instancegroups.go:193] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!

I manually updated the api and api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.

🙂

Upload Limit in Kubernetes Nginx Ingress Controller

According to https://github.com/nginxinc/kubernetes-ingress/issues/21#issuecomment-408618569, this is how to lift the upload limit in Nginx Ingress Controller for Kubernetes after recent update to the project:

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: test-project-ingress
  namespace: test-project-dev
  annotations:
    kubernetes.io/ingress.class: dev
    nginx.ingress.kubernetes.io/proxy-body-size: 200m
spec:
  rules:
    - host: test-project.dev.com
      http:
        paths:
          - path: /
            backend:
              serviceName: test-project
              servicePort: 80

And for now the nginx pods have to be restarted before this can take effect. Hope this won’t be necessary in future

🙂

Auto Scaling in Kubernetes 1.9

I updated my Kubernetes cluster from 1.8 to 1.9 recently, the upgrade process is very smooth, however the auto-scaling part seemed to be failing. Below are some notes on how I troubleshoot this issue.

First to ensure I have both kops and kubectl upgraded to 1.9 on my laptop:

Install kubectl 1.9:

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.9.10/bin/linux/amd64/kubect

Install kops 1.9: https://github.com/kubernetes/kops/releases/tag/1.9.2

I was doing some load testing and I discovered that no matter how busy the pods were, they weren’t scaled out. To see what’s happening with the horizontal pod autoscaler(HPA), I use the following command:

kubectl describe hpa
...
  ScalingActive  False   FailedGetResourceMetric  the HPA was unable to compute the replica count: unable to get metrics for resource cpu: unable to fetch metrics from API: the server could not find the requested resource (get pods.metrics.k8s.io)

After some googling around, it turns out that Kubernetes 1.9 uses new metrics server for HPA, and my cluster didn’t have it. Here’s how to install metrics server for kubernetes cluster: https://github.com/kubernetes-incubator/metrics-server

To make this troubleshooting more interesting, the metrics server encountered error too! Looks like:

Failed to get kubernetes address: No kubernetes source found.

Bug fix for the metrics server:  https://github.com/kubernetes-incubator/metrics-server/issues/105#issuecomment-412818944

In short, adding a overriding command in `deploy/1.8+/metrics-server-deployment.yaml` got it working:

        command:
        - /metrics-server
        - --source=kubernetes.summary_api:''

Install cluster autoscaler for kubernetes cluster: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-one-asg.yaml I used `- image: k8s.gcr.io/cluster-autoscaler:v1.1.3` for Kubernetes 1.9. This part was without any surprises and worked as expected.

Sample HPA schema:

---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: test-hpa
  namespace: test
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: test-deploy
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

🙂