It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.
$ kops rolling-update cluster --yes ... node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained ... I1024 08:52:50.388672 16009 instancegroups.go:188] Validating the cluster. ... I1024 08:58:22.725713 16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout. E1024 08:58:22.725749 16009 instancegroups.go:193] Cluster did not validate within 5m0s error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"
From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!
I manually updated the
api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.