Grant a Service Account an IAM Role in AWS/GCP

How to grant a pod running in a Kubernetes cluster necessary permissions to access cloud resources such as S3 buckets? The most straight forward approach is to save some API key in the pod and use it to authenticate against cloud APIs. If the cluster is running inside the cloud, an IAM role can then be bound to a service account in the cluster, which is both convenient and safe.

I’ll compare the ways IAM role and service account bind in AWS/EKS and GCP/GKE.

AWS/EKS

The EKS is the managed Kubernetes service in AWS. To bind an EKS service account to an AWS IAM role:

  1. Create an IAM OIDC provider for your cluster
  2. Create an IAM role which can be assumed by the EKS service account
  3. Annotate the EKS service account to assume the IAM role

GCP/GKE

The GKE is the managed Kubernetes service in GCP. In GCP this is called Workload Identity(WLI), in a nut shell it binds a GKE service account to a GCP IAM service account, so it’s a bit different than the one above. The full instruction is here but in short:

  1. Enable WLI for the GKE cluster
  2. Create or update node-pool to enable WLI
  3. Create IAM service account and assign roles with necessary permissions
  4. Allow IAM service account to be impersonated by a GKE service account
  5. Annotate the GKE service account to impersonate the GCP service account

🙂

Working with a Big Corporation

So it’s been a while since I started this job in a big corporation. I always enjoy new challenges, now my wish got granted. Not in a very good way.

The things work in a quite different manner here. There are big silos and layers between teams and departments, so the challenges here are not quite technical in nature. How unexpected this is.

Still there are lots of things can be improved with technology, here’s one example. When I was migrating an old web application stack from on-premises infrastructure to AWS, the AWS landing zone has already been provisioned with a duo-VPC setup. I really really miss the days that working with Kubernetes clusters and I can just run kubectl exec -ti ... and get a terminal session quickly.

Now things look like year 2000 and I need to use SSH proxy command again, without old school static IP addresses though. Ansible dynamic inventory is quite handy in most cases but it failed due to some unknown corporate firewall rules. I still have bash, aws-cli and jq, so this is my handy bash script to connect to 1 instance of an auto scaling group, via a bastion host(they both can be rebuilt and change IP).

#!/bin/bash
function get_stack_ip(){
aws ec2 describe-instances \
--fileter "Name=tag-key,Values=aws:cloudformation:stack-name" "Name=tag-value,Values=$1" \
|jq '.Reservations[] |select(.Instance[0].PrivateIpAddress != null).Instance[0].PrivateIpAddress' \
|tr -d '"'
}

Then it’s easy to use this function to get IPs of the bastion stack and the target stack, such as:

IP_BASTION=$(get_stack_ip bastion_stack)
IP_TARGET=$(get_stack_ip target_stack)
ssh -o ProxyCommand="ssh [email protected]_BASTION nc %h %p" [email protected]_TARGET

🙂

Don’t Panic When Kubernetes Master Failed

It was business as usual when I was upgrading our Kubernetes cluster from 1.9.8 to 1.9.10, until it isn’t.

$ kops rolling-update cluster --yes
...
node "ip-10-xx-xx-xx.ap-southeast-2.compute.internal" drained
...
I1024 08:52:50.388672   16009 instancegroups.go:188] Validating the cluster.
...
I1024 08:58:22.725713   16009 instancegroups.go:246] Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes: Get https://api.my.kops.domain/api/v1/nodes: dial tcp yy.yy.yy.yy:443: i/o timeout.
E1024 08:58:22.725749   16009 instancegroups.go:193] Cluster did not validate within 5m0s

error validating cluster after removing a node: cluster did not validate within a duation of "5m0s"

From AWS console I can see the new instance for the master is running and the old one has been terminated. There’s 1 catch though, the IP yy.yy.yy.yy is not the IP of the new master instance!

I manually updated the api and api.internal CNAMEs of the Kubernetes cluster in Route 53 and the issue went away quickly. I assume for some reason the DNS update for the new master has failed, but happy to see everything else worked as expected.

🙂

Manage AWS EBS Snapshot Life Cycle with Lambda

The timing is not so great. The AWS Data Lifecycle Manager has been announced but I can’t wait for its release. So I decided to use AWS Lambda to do some snapshot lifecycle management.

First a role for Lambda having full access to snapshots can be created via the console.

To create snapshot with Python 3.6 Lambda in AWS:

from datetime import datetime, timedelta

import boto3

def get_tag(tags, tag_name):
    for t in tags:
        if t['Key'] == tag_name:
            return t['Value']
    return 'None'
    
def get_delete_date():
    today = datetime.today()
    if today.weekday() == 0: 
        #Monday
        retention = 28
    else:
        retention = 7
    return (today + timedelta(days=retention)).strftime('%Y-%m-%d')
    
def snapshot_tags(instance, volume):
    tags = [{'Key': k, 'Value': str(v)} for k,v in volume.attachments[0].items()]
    tags.append({'Key': 'InstanceName', 'Value': get_tag(instance.tags, 'Name')})
    tags.append({'Key': 'DeleteOn', 'Value': get_delete_date()})
    return tags

def lambda_handler(event, context):
    ec2 = boto3.resource('ec2')
    for instance in ec2.instances.filter(Filters=[{'Name': "tag:Name", 'Values': [ 'AFLCDWH*' ] }]):
        for volume in instance.volumes.all():
            snapshot = ec2.create_snapshot(VolumeId=volume.id, Description="Snapshot for volume {0} on instance {1}".format(volume.id, get_tag(instance.tags, 'Name')))
            snapshot.create_tags(Resources=[snapshot.id], Tags=snapshot_tags(instance, volume))
    return 'done'

To recycle snapshots meant to be deleted today:

from datetime import datetime

import boto3

def lambda_handler(event, context):
    today = datetime.today().strftime('%Y-%m-%d')
    ec2 = boto3.resource('ec2')
    for snapshot in ec2.snapshots.filter(Filters=[{'Name': "tag:DeleteOn", 'Values': [ today ] }]):
        print(snapshot.id)
        snapshot.delete()
    return 'done'

At last, these functions can’t finish in 3 seconds, so the default 3 seconds time-out will kill them. I lifted the time-out to 1 minute.