20 Oct 2016, 08:58

Telegraf & Prometheus Swiss Army Knife for Metrics

There are a lot of different solutions when it comes to collecting metrics, I found myself happy with this hybrid solution.

Telegraf is an agent written in Go for collecting metrics from the system it’s running on.
It’s developed by Influxdata the people behind InfluxDB, but Telegraf has a lot of outputs plugins and can be used without InfluxDB.
Many different platform (FreeBSD, Linux, x86, Arm …) are offered and only one single static binary (Thanks to Golang) is needed to deploy an agent.

Prometheus is a time series database for your metrics, with an efficient storage.
It’s easy to deploy, no external dependencies, it’s gaining traction in the community because it’s a complete solution, for example capable of discovering your targets inside a Kubernete cluster.

Here is a simple configuration to discover both products.

You can deploy the agent on every hosts you want to monitor but need only one Prometheus running.

Install Prometheus

Download Prometheus for your platform and edit a config file named prometheus.yml.

  - job_name: 'telegraf'
    scrape_interval: 10s
      - targets: ['mynode:9126']

Prometheus is a special beast in the monitoring world, the agents are not connecting to the server, it’s the opposite the server is scrapping the agents.
In this config we are creating a job called telegraf to be scrapped every 10s connecting to mynode host on port 9126.

That’s all you need to run a Prometheus server, start it by specifying a path to store the metrics and the path of the config file:

prometheus -storage.local.path /opt/local/var/prometheus -config.file prometheus.yml

The server will listen on port 9090 for the HTTP console.

Install Telegraf

Download Telegraf agent for your platform and edit telegraf.conf.

    listen = ""

# Read metrics about cpu usage
  ## Whether to report per-cpu stats or not
  percpu = true
  ## Whether to report total system cpu stats or not
  totalcpu = true
  ## If true, collect raw CPU time metrics.
  collect_cpu_time = false

# Read metrics about memory usage

# Read metrics about network interface usage
  ## By default, telegraf gathers stats from any up interface (excluding loopback)
  ## Setting interfaces will tell it to gather these explicit interfaces,
  ## regardless of status.
  interfaces = ["en2"]

Remember this is the node agent “client” but since Prometheus server will connect it, you are providing a listening endpoint.
Starts the agent with telegraf -config telegraf.conf

There are many more inputs plugins for telegraf for example you can monitor all your Docker instance.

# Read metrics about docker containers
  endpoint = "unix:///var/run/docker.sock"
  ## Only collect metrics for these containers, collect all if empty
  container_names = []
  ## Timeout for docker list, info, and stats commands
  timeout = "5s"

  ## Whether to report for each container per-device blkio (8:0, 8:1...) and
  ## network (eth0, eth1, ...) stats or not
  perdevice = false
  ## Whether to report for each container total blkio and network stats or not
  total = false

It’s also capable of monitoring third parties product like MySQL, Cassandra

Read Metrics

Prometheus is provided with a visual HTTP console & query tool available on port 9090.

The query language is described here.

The console can’t be really used as a dashboard, you can use Grafana which can speak directly to prometheus.

Install or run Grafana with docker

docker run -i -p 3000:3000 -e "GF_SECURITY_ADMIN_PASSWORD=mypassword"  grafana/grafana

Add a Prometheus data source and point the host to your Prometheus server port 9090.

Then create your dashboard, here are some queries to display the telegraf agents:

  • CPU
    cpu_usage_idle{host="myhost", cpu="cpu-total"}
    cpu_usage_user{host="myhost", cpu="cpu-total"}
    cpu_usage_system{host="myhost", cpu="cpu-total"}

  • Memory

  • Docker Memory
    Legend format {{container_name}}

  • Docker CPU
    Legend format {{container_name}}

Graph prometheus


You can easily instrument your own development using the client libraries.
There is also a Prometheus gateway for the short lived jobs, so you can batch to the gateway between the scrap period.

It’s a simple setup but capable of handling a lot of data in different contexts, system monitoring & instrumentation.