MySQL/ Aurora to Google BigQuery

Google BigQuery(BQ) is Google’s column storage, it’s super fast when you need to query over many gigabytes of data.

BQ defaults to utf8, so it makes sense the tables in MySQL/ Aurora are also utf8 encoded.

I use Python and SQL Alchemy to load data to BQ. It’s a waste for SA’s ORM features but the raw connection in SA is nearly as fast as a naked MySQL connection, also I’m already familiar with SA.

First, make sure the MySQL URL has the correct encoding, eg. “db?charset= utf8”. It took my hours to figure out why some Russian characters became ??? without the charset set while some Spanish characters were fine.

There’s a number of ways to load data to BG, I have tried using REST API, or Google Storage. Using Google Storage is like 10x faster than REST, no surprise. Just like the Amazon S3 + Redshift combo. As for the data source format, JSON is modern and CSV is history because if there’s new line quoted in CSV, it will be limited to 4GB according to BQ’s documents.

BQ supports gzipped JSON which is sweet consider you only need to pay for 1/10 of the data traffic. This is how I stream MySQL output to gzipped JSON using ujson and gzip libraries.

with gzip.open(temp_dir + json_file, 'wb') as gzfile:
  for row in a_list_of_dictionaries:
    ujson.dump(row, gzfile, ensure_ascii=False)
    gzfile.write('\n')

Below is a snippet I used to map MySQL’s data types to BQ:

schema_sql = """select CONCAT('{"name": "', COLUMN_NAME, '","type":"', DATA_TYPE, '"}') as json from information_schema.columns where TABLE_SCHEMA = :database AND TABLE_NAME = :table;"""
fields = [] 
results = connection.execute(text(schema_sql), {'database':database, 'table': table} )
for row in results:
  field = ujson.loads(getattr(row, 'json'))
  if re.match('^[0-9]', field['name']):
    field['name'] = bq_prefix + field['name']
  if re.match('(bool|boolean)', field['type']):
    field['type'] = 'BOOLEAN'
  elif re.match('.*int.*', field['type']):
    field['type'] = 'INTEGER'
  elif re.match('.*(float|decimal|double).*', field['type']):
    field['type'] = 'FLOAT'
  elif re.match('.*(datetime|timestamp).*', field['type']):
    field['type'] = 'TIMESTAMP'
  elif re.match('.*binary.*', field['type']):
    field['type'] = 'BYTES'
  else:
    field['type'] = 'STRING'
  fields.append(field)

Using gsutil is very simple, similar to s3cmd.

call(['gsutil', 'cp', json_file, bucket])

And later when I load the dumped JSON I can use these fields to rebuild the schema in BQ:

gbq_schema = ','.join([ "{0}:{1}".format(col['name'], col['type']) for col in fields ])
call(['bq', 'load', '--source_format', 'NEWLINE_DELIMITED_JSON', '--replace', gbq_table, gs_source, gbq_schema])

That’s about it. :)

 

Change Ganeti’s Network Configuration

Ganeti is a cluster virtual server management software tool built on top of existing virtualization technologies such as Xen or KVM and other open source software. ”

This is how I changed the secondary network configuration using Ganeti command line tools.

1, First, say I need to change the network from 10.0.0.0/24 to 10.1.0.0/16, I first will remove all 10.0.0.0/24 NICs from all instances. The following command will remove the last NIC from the instance:

gnt-instance modify --net remove <instance>

2, After all 10.0.0.0/24 NICs have been removed from all instances, the network can be disconnected:

gnt-network disconnect <network>

3, Next is to remove the network CIDR from the cluster:

 gnt-network remove <network>

4, Re-add the network with the new CIDR:

gnt-network add --network=10.1.0.0/16 <network>

5, Re-connect the network to the cluster:

gnt-network connect<network> bridged <bridge>

6, Re-add the NIC from the new network to every instance:

 gnt-instance modify --net add:network=<network>,ip=pool <instance>

7, The new NIC won’t be effected until the instance is rebooted by Ganeti:

 gnt-instance reboot <instance>

8, Not sure if there’s a way to pickup the IP automatically, at least I can assign the IP manually by editing /etc/ network/ interfaces with the new IP.

9, Execute `ifup` to bring up the NIC. That’s it!  :)

望子成龙

貌似我只是一名很普通的家长,随着笑笑渐渐长大,我也不自觉的开始有各种期望:希望她学习好,希望我们感兴趣的事情她也一样感兴趣,希望我没做到的事情她将来能够做到,等等。

有了期望,就难免有失望。因为总要有一些优秀的“别人家的孩子”,让家长们不安和恐惧。这个我也不例外,偶尔会给笑笑施加压力,让她更“努力”。有一天,我看到了Linkin Park的歌曲 Numb(麻木)的MTV,故事里的女孩子喜欢哥特风格,酷爱绘画,在数学课上偷着画画结果被老师训斥,然后在家里她的妈妈不停的向她抱怨。我很容易就对号入座了,因为笑笑也很喜欢绘画,很可能她也会在数学课上偷偷画画,因为她的算术一直不是很好。很难说我没有抱怨甚至训斥过笑笑。
wp-1458301679310.jpg

也许这就是本能吧。然而本能都是对的吗?如果是的,那教育也就没必要存在了,不是吗?

希望在将来的日子里,我们能有一个平和的心态,多鼓励,少训斥。我们做的不好的方面,很可能我们的孩子也不会很强。如果一个酒鬼爸爸训斥儿子酗酒,那是不是很理智呢?

至于成功,我当然希望孩子是成功的。但成功的标准是什么?身心健康又功成名就自然最好,若是没这个天份,只能选择其一的话,那还是身心健康吧。

When some see coincidence, I see consequence