Don’t Panic When Kubernetes Master Failed

It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.

$ kops rolling-update cluster --yes
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
I1024 08:52:50.388672   16009 instancegroups.go:188] Validating the cluster.
I1024 08:58:22.725713   16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749   16009 instancegroups.go:193] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!

I manually updated the api and api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.


Kops: Add Policies for Migrated Apps

When migrating some old applications to a Kubernetes(k8s) cluster provisioned by kops, a lot of things might break and one of them is the missing policy for the node.

By default, nodes of a k8s cluster have the following permissions:

 // The following permissions are scoped to AWS Route53 HostedZone used to bootstrap the cluster
 // arn:aws:route53:::hostedzone/$hosted_zone_id
 route53:ChangeResourceRecordSets, ListResourceRecordSets, GetHostedZone

Additional policies can be added to the nodes’ role by

kops edit cluster ${CLUSTER_NAME}

Then adding something like:

    node: |
          "Effect": "Allow",
          "Action": ["dynamodb:*"],
          "Resource": ["*"]
          "Effect": "Allow",
          "Action": ["es:*"],
          "Resource": ["*"]

Then it will be effective after:

kops update cluster ${CLUSTER_NAME} --yes

The new policy can be reviewed in AWS IAM console.

Most lines were copied from here: