Use Fluentd and Elasticsearch to Analyse Squid Proxy Traffic

TL;DR This is a quick guide to set up Fluentd + Elasticsearch integration to analyse Squid Proxy traffic. In the example below Fluentd td-agent is installed in the same host as Squid Proxy and Elasticsearch is installed in the other host. The OS is Ubuntu 20.04.

Useful links:
– Fluentd installation: https://docs.fluentd.org/installation/install-by-deb
– Elasticsearch installation: https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html

The logs of Squid need to be accessible by td-agent, it can be done by adding td-agent user to the proxy group:

$ sudo usermod --groups proxy -a td-agent

The configuration for td-agent looks like

<source>
  @type tail
  @id squid_tail
  <parse>
    @type regexp
    expression /^(?<timestamp>[0-9]+)[\.0-9]* +(?<elapsed>[0-9]+) (?<userIP>[0-9\.]+) (?<action>[A-Z_]+)\/(?<statusCode>[0-9]+) (?<size>[0-9]+) (?<method>[A-Z]+) (?<URL>[^ ]+) (?<rfc931>[^ ]+) (?<peerStatus>[^ ]+)/(?<peerIP>[^ ]+) (?<mime>[^ ]+)/
    time_key timestamp
    time_format %s
  </parse>
  path /var/log/squid/access.log
  tag squid.access
</source>

<match squid.access>
  @type elasticsearch
  host <elasticsearch server IP>
  port 9200
  logstash_format true
  flush_interval 10s
  index_name fluentd
  type_name fluentd
  include_tag_key true
  user elastic
  password <elsticsearch password>
</match>

The key is to get the regex expression to fit the Squid access log, which looks like

1598101487.920 240256 192.168.10.111 TCP_TUNNEL/200 1562 CONNECT www.google.com.au:443 - HIER_DIRECT/142.250.66.163 -

Then I can use the fields defined in the regex, such as userIP or URL in Elasticsearch for queries.

🙂

Install Fluentd with Ansible

Fluentd has become the popular open source log aggregration framework for a while. I’ll try to give it a spin with Ansible. There are quite some existing Ansible playbooks to install Fluentd out there, but I would like to do it from scratch just to understand how it works.

From the installation guide page, I can grab the script and dependencies and then translate them into Ansible tasks:

---
# roles/fluentd-collector/tasks/install-xenial.yml
- name: install os packages
  package:
    name: '{{ item }}'
    state: latest
  with_items:
    - libcurl4-gnutls-dev
    - build-essential

- name: insatll fluentd on debian/ubuntu
  raw: "curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent2.sh | sh"

Then it can be included by the main task:

# roles/fluentd-collector/tasks/main.yml
# (incomplete)
- include: install-debian.yml
  when: ansible_os_family == 'Debian'

In the log collecting end, I need to configure /etc/td-agent/td-agent.conf to let fluentd(the stable release is call td-agent) receive syslog, tail other logs and then forward the data to the central collector end. Here’s some sample configuration with jinja2 template place holders:

<match *.**>
  type forward
  phi_threshold 100
  hard_timeout 60s
  <server>
    name mycollector
    host {{ fluent_server_ip }}
    port {{ fluent_server_port }}
    weight 10
  </server>
</match>
<source>
  type syslog
  port 42185
  tag {{ inventory_hostname }}.system
</source>

{% for tail in fluentd.tails %}
<source>
  type tail
  format {{ tail.format }}
  time_format {{ tail.time_format }}
  path {{ tail.file }}
  pos_file /var/log/td-agent/pos.{{ tail.name }}
  tag {{ inventory_hostname }}.{{ tail.name }}
</source>
{% endfor %}

At the aggregator’s end, a sample configuration can look like:

<source>
  type forward
  port {{ fluentd_server_port }}
</source>

<match *.**>
  @type elasticsearch
  logstash_format true
  flush_interval 10s
  index_name fluentd
  type_name fluentd
  include_tag_key true
  user {{ es_user }}
  password {{ es_pass }}
</match>

Then the fluentd/td-agent can aggregate all logs from peers and forward to Elasticsearch in LogStash format.

🙂