There are a lot of different solutions when it comes to collecting metrics, I found myself happy with this hybrid solution.
Telegraf is an agent written in Go for collecting metrics from the system it’s running on.
It’s developed by Influxdata the people behind InfluxDB, but Telegraf has a lot of outputs plugins and can be used without InfluxDB.
Many different platform (FreeBSD, Linux, x86, Arm …) are offered and only one single static binary (Thanks to Golang) is needed to deploy an agent.
Prometheus is a time series database for your metrics, with an efficient storage.
It’s easy to deploy, no external dependencies, it’s gaining traction in the community because it’s a complete solution, for example capable of discovering your targets inside a Kubernete cluster.
Here is a simple configuration to discover both products.
You can deploy the agent on every hosts you want to monitor but need only one Prometheus running.
Download Prometheus for your platform and edit a config file named
scrape_configs: - job_name: 'telegraf' scrape_interval: 10s static_configs: - targets: ['mynode:9126']
Prometheus is a special beast in the monitoring world, the agents are not connecting to the server, it’s the opposite the server is scrapping the agents.
In this config we are creating a job called
telegraf to be scrapped every 10s connecting to mynode host on port
That’s all you need to run a Prometheus server, start it by specifying a path to store the metrics and the path of the config file:
prometheus -storage.local.path /opt/local/var/prometheus -config.file prometheus.yml
The server will listen on port
9090 for the HTTP console.
Download Telegraf agent for your platform and edit
[[outputs.prometheus_client]] listen = "127.0.0.1:9126" # Read metrics about cpu usage [[inputs.cpu]] ## Whether to report per-cpu stats or not percpu = true ## Whether to report total system cpu stats or not totalcpu = true ## If true, collect raw CPU time metrics. collect_cpu_time = false # Read metrics about memory usage [[inputs.mem]] # Read metrics about network interface usage [[inputs.net]] ## By default, telegraf gathers stats from any up interface (excluding loopback) ## Setting interfaces will tell it to gather these explicit interfaces, ## regardless of status. interfaces = ["en2"]
Remember this is the node agent “client” but since Prometheus server will connect it, you are providing a listening endpoint.
Starts the agent with
telegraf -config telegraf.conf
There are many more inputs plugins for telegraf for example you can monitor all your Docker instance.
# Read metrics about docker containers [[inputs.docker]] endpoint = "unix:///var/run/docker.sock" ## Only collect metrics for these containers, collect all if empty container_names =  ## Timeout for docker list, info, and stats commands timeout = "5s" ## Whether to report for each container per-device blkio (8:0, 8:1...) and ## network (eth0, eth1, ...) stats or not perdevice = false ## Whether to report for each container total blkio and network stats or not total = false
It’s also capable of monitoring third parties product like MySQL, Cassandra …
Prometheus is provided with a visual HTTP console & query tool available on port
The query language is described here.
The console can’t be really used as a dashboard, you can use Grafana which can speak directly to prometheus.
Install or run Grafana with docker
docker run -i -p 3000:3000 -e "GF_SECURITY_ADMIN_PASSWORD=mypassword" grafana/grafana
Point your browser to the port
Add a Prometheus data source and point the host to your Prometheus server port
Then create your dashboard, here are some queries to display the telegraf agents:
You can easily instrument your own development using the client libraries.
There is also a Prometheus gateway for the short lived jobs, so you can batch to the gateway between the scrap period.
It’s a simple setup but capable of handling a lot of data in different contexts, system monitoring & instrumentation.