So it’s been a while since I started this job in a big corporation. I always enjoy new challenges, now my wish got granted. Not in a very good way.
The things work in a quite different manner here. There are big silos and layers between teams and departments, so the challenges here are not quite technical in nature. How unexpected this is.
Still there are lots of things can be improved with technology, here’s one example. When I was migrating an old web application stack from on-premises infrastructure to AWS, the AWS landing zone has already been provisioned with a duo-VPC setup. I really really miss the days that working with Kubernetes clusters and I can just run
kubectl exec -ti ... and get a terminal session quickly.
Now things look like year 2000 and I need to use SSH proxy command again, without old school static IP addresses though. Ansible dynamic inventory is quite handy in most cases but it failed due to some unknown corporate firewall rules. I still have bash, aws-cli and jq, so this is my handy bash script to connect to 1 instance of an auto scaling group, via a bastion host(they both can be rebuilt and change IP).
aws ec2 describe-instances \
--fileter "Name=tag-key,Values=aws:cloudformation:stack-name" "Name=tag-value,Values=$1" \
|jq '.Reservations |select(.Instance.PrivateIpAddress != null).Instance.PrivateIpAddress' \
|tr -d '"'
Then it’s easy to use this function to get IPs of the bastion stack and the target stack, such as:
ssh -o ProxyCommand="ssh [email protected]_BASTION nc %h %p" [email protected]_TARGET
The default strategy to do rolling update in a Kubernetes deployment is to reduce the capacity of current replica set and then add the capacity to the new replica set. This probably means total processing power for the app could be hindered a bit during the deployment.
I’m a bit surprised to find that the default strategy works this way. But luckily it’s not hard to fine tune this. According to the doc here: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment only a few lines is needed to change the strategy:
maxUnavailable: 0 this means the total capacity of the deployment will not be reduced. and
maxSurge: 40% means new replica can reach 40% of the capacity of the current replica set, before the current one become the old one and drained.
Not a big improvement but
revisionHistoryLimit: 3 will only keep 3 replica sets for purpose to roll back a deployment. The default for this is unlimited, which is quite over provisioned, from my point of view.
I use Ansible as provisioner for Packer, to build AMIs to be used as a base image of our development environment. When Ansible is used by Packer, it’s not quite intuitive whether it’s using the same ansible.cfg when I run
ansible-playbook command in a terminal.
Here’s how to make sure Ansible in Packer session will use the correct ansible.cfg file.
First, an ENV is supplied in Packer’s template, because ENV precedes any other configuration that can be found:
The line with “ANSIBLE_CONFIG=/tmp/ansible.cfg” will tell ansible to use /tmp/ansible.cfg.
With the ansible.cfg at /tmp, and the extra debug switch
-vvv I can see in the output if the config file is picked up.
It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.
$ kops rolling-update cluster --yes
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
I1024 08:52:50.388672 16009 instancegroups.go:188] Validating the cluster.
I1024 08:58:22.725713 16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749 16009 instancegroups.go:193] Cluster did not validate within 5m0s
error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"
From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!
I manually updated the
api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.