Working with a Big Corporation

So it’s been a while since I started this job in a big corporation. I always enjoy new challenges, now my wish got granted. Not in a very good way.

The things work in a quite different manner here. There are big silos and layers between teams and departments, so the challenges here are not quite technical in nature. How unexpected this is.

Still there are lots of things can be improved with technology, here’s one example. When I was migrating an old web application stack from on-premises infrastructure to AWS, the AWS landing zone has already been provisioned with a duo-VPC setup. I really really miss the days that working with Kubernetes clusters and I can just run kubectl exec -ti ... and get a terminal session quickly.

Now things look like year 2000 and I need to use SSH proxy command again, without old school static IP addresses though. Ansible dynamic inventory is quite handy in most cases but it failed due to some unknown corporate firewall rules. I still have bash, aws-cli and jq, so this is my handy bash script to connect to 1 instance of an auto scaling group, via a bastion host(they both can be rebuilt and change IP).

#!/bin/bash
function get_stack_ip(){
aws ec2 describe-instances \
--fileter "Name=tag-key,Values=aws:cloudformation:stack-name" "Name=tag-value,Values=$1" \
|jq '.Reservations[] |select(.Instance[0].PrivateIpAddress != null).Instance[0].PrivateIpAddress' \
|tr -d '"'
}

Then it’s easy to use this function to get IPs of the bastion stack and the target stack, such as:

IP_BASTION=$(get_stack_ip bastion_stack)
IP_TARGET=$(get_stack_ip target_stack)
ssh -o ProxyCommand="ssh [email protected]_BASTION nc %h %p" [email protected]_TARGET

🙂

Nicer Deployment with Kubernetes

The default strategy to do rolling update in a Kubernetes deployment is to reduce the capacity of current replica set and then add the capacity to the new replica set. This probably means total processing power for the app could be hindered a bit during the deployment.

I’m a bit surprised to find that the default strategy works this way. But luckily it’s not hard to fine tune this. According to the doc here: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment only a few lines is needed to change the strategy:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: my-deploy
namespace: my-project
spec:
strategy:
rollingUpdate:
maxUnavailable: 0
maxSurge: 40%
revisionHistoryLimit: 3

By maxUnavailable: 0 this means the total capacity of the deployment will not be reduced. and maxSurge: 40% means new replica can reach 40% of the capacity of the current replica set, before the current one become the old one and drained.

Not a big improvement but revisionHistoryLimit: 3 will only keep 3 replica sets for purpose to roll back a deployment. The default for this is unlimited, which is quite over provisioned, from my point of view.

🙂

Ansible, Packer and Configurations

I use Ansible as provisioner for Packer, to build AMIs to be used as a base image of our development environment. When Ansible is used by Packer, it’s not quite intuitive whether it’s using the same ansible.cfg when I run ansible-playbook command in a terminal.

Here’s how to make sure Ansible in Packer session will use the correct ansible.cfg file. 

First, an ENV is supplied in Packer’s template, because ENV precedes any other configuration that can be found:

    {
      "type": "ansible",
      "playbook_file": "../../ansible/apps.yml",
      "ansible_env_vars": ["ANSIBLE_CONFIG=/tmp/ansible.cfg"],
      "extra_arguments": [
        "--skip-tags=packer-skip",
        "-vvv"
      ]
    },

The line with “ANSIBLE_CONFIG=/tmp/ansible.cfg” will tell ansible to use /tmp/ansible.cfg.

With the ansible.cfg at /tmp, and the extra debug switch -vvv I can see in the output if the config file is picked up.


Don’t Panic When Kubernetes Master Failed

It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.

$ kops rolling-update cluster --yes
...
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
...
I1024 08:52:50.388672   16009 instancegroups.go:188] Validating the cluster.
...
I1024 08:58:22.725713   16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749   16009 instancegroups.go:193] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!

I manually updated the api and api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.

🙂