Saving Images in Chrome with Selenium

Here’s how to save images in Chrome browser using Selenium. The API has element.screenshot_as_png() method but apparently it’s not implemented at the moment.

With some minor changes to this answer I can save a image via the browser: http://stackoverflow.com/questions/13832322/how-to-capture-the-screenshot-of-a-specific-element-rather-than-entire-page-usin

from selenium import webdriver
from PIL import Image

#chrome_options = ...
#chrome = webdriver.Chrome(chrome_options=chrome_options)
#element = chrome.find_element_by_id('some_id')

def save_image(chrome, element, save_path):
  # in case the image isn't isn't in the view yet
  location = element.location_once_scrolled_into_view
  size = element.size

  # saves screenshot of entire page
  chrome.save_screenshot(save_path)
  # uses PIL library to open image in memory

  image = Image.open(save_path)
  left = location['x']
  top = location['y']
  right = location['x'] + size['width']
  bottom = location['y'] + size['height']
  image = image.crop((left, top, right, bottom))  # defines crop points
  image.save(save_path, 'png')  # saves new cropped image

🙂

A Simple Python IO Trick

I thought using Python’s gzip is quite straight forward, however the IO performance almost doubled with an extra BufferedWriter:

import io
import gzip

def export_to_json_gz(schema):
  json_file = "test.json.gz"
  with io.BufferedWriter( gzip.open(temp_dir + json_file, 'wb') ) as gzfile:
    for row in stream_table_data(schema):
      ujson.dump(row, gzfile, ensure_ascii=False)
      gzfile.write('\n')

Now the bottle neck is ujson, how can I make that part faster?

🙂

MySQL/ Aurora to Google BigQuery

Google BigQuery(BQ) is Google’s column storage, it’s super fast when you need to query over many gigabytes of data.

BQ defaults to utf8, so it makes sense the tables in MySQL/ Aurora are also utf8 encoded.

I use Python and SQL Alchemy to load data to BQ. It’s a waste for SA’s ORM features but the raw connection in SA is nearly as fast as a naked MySQL connection, also I’m already familiar with SA.

First, make sure the MySQL URL has the correct encoding, eg. “db?charset= utf8”. It took my hours to figure out why some Russian characters became ??? without the charset set while some Spanish characters were fine.

There’s a number of ways to load data to BG, I have tried using REST API, or Google Storage. Using Google Storage is like 10x faster than REST, no surprise. Just like the Amazon S3 + Redshift combo. As for the data source format, JSON is modern and CSV is history because if there’s new line quoted in CSV, it will be limited to 4GB according to BQ’s documents.

BQ supports gzipped JSON which is sweet consider you only need to pay for 1/10 of the data traffic. This is how I stream MySQL output to gzipped JSON using ujson and gzip libraries.

with gzip.open(temp_dir + json_file, 'wb') as gzfile:
  for row in a_list_of_dictionaries:
    ujson.dump(row, gzfile, ensure_ascii=False)
    gzfile.write('\n')

Below is a snippet I used to map MySQL’s data types to BQ:

schema_sql = """select CONCAT('{"name": "', COLUMN_NAME, '","type":"', DATA_TYPE, '"}') as json from information_schema.columns where TABLE_SCHEMA = :database AND TABLE_NAME = :table;"""
fields = []
results = connection.execute(text(schema_sql), {'database':database, 'table': table} )
for row in results:
  field = ujson.loads(getattr(row, 'json'))
  if re.match('^[0-9]', field['name']):
    field['name'] = bq_prefix + field['name']
  if re.match('(bool|boolean)', field['type']):
    field['type'] = 'BOOLEAN'
  elif re.match('.*int.*', field['type']):
    field['type'] = 'INTEGER'
  elif re.match('.*(float|decimal|double).*', field['type']):
    field['type'] = 'FLOAT'
  elif re.match('.*(datetime|timestamp).*', field['type']):
    field['type'] = 'TIMESTAMP'
  elif re.match('.*binary.*', field['type']):
    field['type'] = 'BYTES'
  else:
    field['type'] = 'STRING'
  fields.append(field)

Using gsutil is very simple, similar to s3cmd.

call(['gsutil', 'cp', json_file, bucket])

And later when I load the dumped JSON I can use these fields to rebuild the schema in BQ:

gbq_schema = ','.join([ "{0}:{1}".format(col['name'], col['type']) for col in fields ])
call(['bq', 'load', '--source_format', 'NEWLINE_DELIMITED_JSON', '--replace', gbq_table, gs_source, gbq_schema])

That’s about it. 🙂

 

学TurboGears 2顺便还得学习一下Genshi

Genshi是Python环境下的一个比较出名的HTML/XML/…模板框架. 作为TurboGears2.1缺省的模板组件, Genshi得到了很多好评. 我也正是因为TG粘合了如此优秀的组件而决心学习TG的.

TG2.1样本项目里的主模板文件master.html的完整解释: http://www.turbogears.org/2.1/docs/main/master_html.html#master-html
Genshi模板基础: http://genshi.edgewall.org/wiki/Documentation/templates.html

首先, 在TG2中通过@expose decorator来为一个View(V in MVC)指定Genshi模板. 如果不需要模板, 就是:

@expose()
def hello(self):
    return "Hello world from controller."

这样在缺省的开发环境下, 访问http://127.0.0.1:8080/hello, 将得到简单的

Hello world from controller.

当然, 这除了测试似乎没有什么实际意义. 然后是使用模板的版本:

    @expose('example.templates.index')
    def new_hello(self):
        return dict(hello="New text from controller + template ", page="new_hello")

这样, TG会寻找example项目下的example/templates/index.html作为new_hello的模板. 同时, 将两个key传递给模板: hello和page. 而在index模板内, 可以很随意的调用这两个获得的key.

<h1 py:content=”hello”>old Hello, world.</h1>
Now Viewing: <span py:replace=”page”/>

区别在于, <h1>内的内容”old Hello, world.”会被hello的值替代, 而<span>则并不会出现, 完全被page值取代. 另外模板里的${text} 和php里面的<?php echo $text; ?>类似吧.