Don’t Panic When Kubernetes Master Failed

It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.

$ kops rolling-update cluster --yes
...
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
...
I1024 08:52:50.388672   16009 instancegroups.go:188] Validating the cluster.
...
I1024 08:58:22.725713   16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749   16009 instancegroups.go:193] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!

I manually updated the api and api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.

🙂

Kops: Add Policies for Migrated Apps

When migrating some old applications to a Kubernetes(k8s) cluster provisioned by kops, a lot of things might break and one of them is the missing policy for the node.

By default, nodes of a k8s cluster have the following permissions:

ec2:Describe*
 ecr:GetAuthorizationToken
 ecr:BatchCheckLayerAvailability
 ecr:GetDownloadUrlForLayer
 ecr:GetRepositoryPolicy
 ecr:DescribeRepositories
 ecr:ListImages
 ecr:BatchGetImage
 route53:ListHostedZones
 route53:GetChange
 // The following permissions are scoped to AWS Route53 HostedZone used to bootstrap the cluster
 // arn:aws:route53:::hostedzone/$hosted_zone_id
 route53:ChangeResourceRecordSets, ListResourceRecordSets, GetHostedZone

Additional policies can be added to the nodes’ role by

kops edit cluster ${CLUSTER_NAME}

Then adding something like:

spec:
  additionalPolicies:
    node: |
      [
        {
          "Effect": "Allow",
          "Action": ["dynamodb:*"],
          "Resource": ["*"]
        },
        {
          "Effect": "Allow",
          "Action": ["es:*"],
          "Resource": ["*"]
        }
      ]

Then it will be effective after:

kops update cluster ${CLUSTER_NAME} --yes

The new policy can be reviewed in AWS IAM console.

Most lines were copied from here: https://github.com/kubernetes/kops/blob/master/docs/iam_roles.md

🙂