几个 Linux 小工具

最近很忙, 有几个月没写 blog 了. 顺便写一下最近用到的很顺手的几个小工具, 当然, 都是跟 Linux 相关的.

Mutt 是个 CLI 界面的邮件工具, 我主要用它从程序里发邮件. 安装 mutt 很简单, 直接

apt-get install mutt

即可. mutt 的设置可以写在当前用户的”家”里:

cat ~/.mutt/muttrc
set smtp=”smtp://my.server.com:25/”
set from=”No reply <[email protected]>”

发送邮件的命令相当简单, 很容易和其他软件”串”到一起:

echo “This is the message” |mutt -s “This is the subject” -c “[email protected]” -a /tmp/attachment.txt /tmp/att2.txt — [email protected]

Httpie 是手工调试 HTTP 的好帮手, 而且比 curl 简单很多很多. 你记得住 curl 的那些开关么? 我反正是记不住的.

Pigz 是 gzip 的并行计算版. 我还没做性能测试, 应该不会更慢吧 🙂

Pendulum(钟摆, 这名字用的不错)是个 Python 日期函数库, 是标准函数的人性版, 句法更接近英语.

PS. Linux 25 岁了!

 

Distribute cron jobs to hours/minutes with Ansible

This is a handy trick to run a batch of cron jobs on different hour/minute combination so they won’t collide with each other and cause some pressure on the server.

The key is to use `with_indexed_items` and Jinja2 math:

- name: ansible daily cronjob {{ item.1 }}
  cron: user=ansible name=ansible-daily-{{ item.1 }} hour={{ item.0 % 24 }} minute= {{ (item.0 * 5) % 60 }}  job="/usr/local/run.sh {{ item.1  }}"
  with_indexed_items: 
    - job_foo
    - job_bar

So the `job_foo` will be running at hour 0, minute 0 and `job_bar` will be running at hour 1, minute 5, etc…

🙂

MySQL/ Aurora to Google BigQuery

Google BigQuery(BQ) is Google’s column storage, it’s super fast when you need to query over many gigabytes of data.

BQ defaults to utf8, so it makes sense the tables in MySQL/ Aurora are also utf8 encoded.

I use Python and SQL Alchemy to load data to BQ. It’s a waste for SA’s ORM features but the raw connection in SA is nearly as fast as a naked MySQL connection, also I’m already familiar with SA.

First, make sure the MySQL URL has the correct encoding, eg. “db?charset= utf8”. It took my hours to figure out why some Russian characters became ??? without the charset set while some Spanish characters were fine.

There’s a number of ways to load data to BG, I have tried using REST API, or Google Storage. Using Google Storage is like 10x faster than REST, no surprise. Just like the Amazon S3 + Redshift combo. As for the data source format, JSON is modern and CSV is history because if there’s new line quoted in CSV, it will be limited to 4GB according to BQ’s documents.

BQ supports gzipped JSON which is sweet consider you only need to pay for 1/10 of the data traffic. This is how I stream MySQL output to gzipped JSON using ujson and gzip libraries.

with gzip.open(temp_dir + json_file, 'wb') as gzfile:
  for row in a_list_of_dictionaries:
    ujson.dump(row, gzfile, ensure_ascii=False)
    gzfile.write('\n')

Below is a snippet I used to map MySQL’s data types to BQ:

schema_sql = """select CONCAT('{"name": "', COLUMN_NAME, '","type":"', DATA_TYPE, '"}') as json from information_schema.columns where TABLE_SCHEMA = :database AND TABLE_NAME = :table;"""
fields = [] 
results = connection.execute(text(schema_sql), {'database':database, 'table': table} )
for row in results:
  field = ujson.loads(getattr(row, 'json'))
  if re.match('^[0-9]', field['name']):
    field['name'] = bq_prefix + field['name']
  if re.match('(bool|boolean)', field['type']):
    field['type'] = 'BOOLEAN'
  elif re.match('.*int.*', field['type']):
    field['type'] = 'INTEGER'
  elif re.match('.*(float|decimal|double).*', field['type']):
    field['type'] = 'FLOAT'
  elif re.match('.*(datetime|timestamp).*', field['type']):
    field['type'] = 'TIMESTAMP'
  elif re.match('.*binary.*', field['type']):
    field['type'] = 'BYTES'
  else:
    field['type'] = 'STRING'
  fields.append(field)

Using gsutil is very simple, similar to s3cmd.

call(['gsutil', 'cp', json_file, bucket])

And later when I load the dumped JSON I can use these fields to rebuild the schema in BQ:

gbq_schema = ','.join([ "{0}:{1}".format(col['name'], col['type']) for col in fields ])
call(['bq', 'load', '--source_format', 'NEWLINE_DELIMITED_JSON', '--replace', gbq_table, gs_source, gbq_schema])

That’s about it. 🙂

 

We're all puppets, Laurie. I'm just a puppet who can see the strings.