Prometheus: Discovering Services with Consul

In my previous post, I detailed moving my home monitoring over to Prometheus. I’ve gained huge insights into my home network (and a few external services I rely on), and have been very happy with it.

Adding new endpoints has been pretty straightforward. I have been using Ansible to generate the prometheus.yml configuration file, using variables to generate each section of the scrape configuration. This has worked equally well for both services exposing native Prometheus endpoints (e.g. Cadvisor or Traefik) and for the numerous exporters I am running.

The issue with this approach is that it requires reloading the Prometheus configuration every time I add a service and/or an endpoint. It also requires a central point of configuration management to decide what is monitored, rather than hosts notifying something that it has a Prometheus-compatible metrics endpoint.

Enter Service Discovery

Service Discovery

As Wikipedia describes: -

Service discovery is the automatic detection of devices and services offered by these devices on a computer network.

Put another way, it allows an application to dynamically discover services, rather than the services being statically defined in the applications configuration.

For Prometheus, there a number of methods it can use for service discovery. These range from talking to cloud provider APIs (like AWS, Azure, GCE), DNS-based discovery (using SRV records) to querying the Kubernetes API for running services.

I have chosen Consul, a configuration and service store by Hashicorp (who also created Terraform, Packer and Vagrant).

Consul

Consul uses a Server/Client-style approach. The recommendation for production usage is a minimum of three servers (for quorum), however in a home environment you can run it as low as one.

Agents register with the server(s), and will supply a list of services that are running on them. Adding additional services can be done via the Consul CLI, the API, or you can do it using files that are in the Consul configuration directory.

The Hashicorp documentation on setting up a cluster is very good, so I would advise reading this and following it if you want to set up a cluster of your own. I made a couple of changes to it to suit my environment, as detailed below.

Server Configuration

My setup on the server is configured as such: -

/etc/systemd/system/consul.service

[Unit]
Description="HashiCorp Consul - A service mesh solution"
Documentation=https://www.consul.io/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/consul.d/consul.hcl

[Service]
Type=notify
User=consul
Group=consul
ExecStart=/usr/local/bin/consul agent -config-dir=/etc/consul.d/
ExecReload=/usr/local/bin/consul reload
KillMode=process
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

The above runs the Consul binary, and looks to the /etc/consul.d/ directory for configuration

/etc/consul.d/consul.hcl

datacenter = "noisepalace"
data_dir = "/opt/consul"
encrypt = "${CONSUL_ENCRYPTION_KEY}"
retry_join = ["192.168.0.7"]
bind_addr = "192.168.0.7"

performance {
  raft_multiplier = 1
}

The bind_addr is statically set in this, because the server runs a number of Docker containers too, so Consul doesn’t know which interface to bind to without it. The retry_join parameter is used to discover the server. In this case, it is discovering itself, but the Consul server also runs the client too, so it needs to know how to contact the server.

To create the encryption key, use consul keygen. This key will be used by all your nodes in the “datacenter” (in this case, the confines of my house).

/etc/consul.d/server.hcl

server = true
bootstrap_expect = 1
ui = true
bind_addr = "192.168.0.7"
client_addr = "0.0.0.0"

In the above, the ui variable is used to enable the Consul Web UI. The bootstrap_expect variable is used to say how many servers are required to form the first Consul cluster. I have set it to 1, to allow a cluster with a single server.

The client_addr variable is used to say what address the API and UI listen on. In a production environment you will want to lock this down to one IP.

Client Configuration

The Client configuration is exactly the same as described in the Server Configuration section, except without the server.hcl file. You will want to change the bind_addr variable to the IP of the host it is running on (or remove it entirely if it only has one interface with an IP on).

Verification

After you have configured the cluster, you should be able to see something like this on the master: -

$ consul members         
Node                        Address             Status  Type    Build  Protocol  DC           Segment
meshuggah                   192.168.0.7:8301    alive   server  1.6.1  2         noisepalace  <all>
config-01                   192.168.0.220:8301  alive   client  1.6.1  2         noisepalace  <default>
db-01                       192.168.0.229:8301  alive   client  1.6.1  2         noisepalace  <default>
exodus                      192.168.0.252:8301  alive   client  1.6.1  2         noisepalace  <default>
git-01                      192.168.0.223:8301  alive   client  1.6.1  2         noisepalace  <default>
ns-03                       192.168.0.221:8301  alive   client  1.6.1  2         noisepalace  <default>
teramaze                    192.168.0.236:8301  alive   client  1.6.1  2         noisepalace  <default>
testament                   192.168.0.253:8301  alive   client  1.6.1  2         noisepalace  <default>
vanhalen                    192.168.0.3:8301    alive   client  1.6.1  2         noisepalace  <default>
vektor                      192.168.0.251:8301  alive   client  1.6.1  2         noisepalace  <default>
vpn-01                      192.168.0.222:8301  alive   client  1.6.1  2         noisepalace  <default>
vps-shme                    192.168.100.1:8301  alive   client  1.6.1  2         noisepalace  <default>
vyos-01                     192.168.0.225:8301  alive   client  1.6.1  2         noisepalace  <default>

Excuse the mixed naming scheme, as I’m halfway between everything being named after bands (physical machines) and purpose (virtual machines). I will standardize at some point…

The Consul UI should also be available at this point, which you’ll be able to see at https://${YOUR-SERVER-IP}:8500/ui/

Consul Interface

Where are my services?

Once Consul is setup, it needs to know about services you want to expose on each client. As noted, these can be added via the Consul CLI (using the consul service register directive), via the API, or using files in the Consul configuration directory.

To add a service via a file, it needs to be formatted something like the below: -

{
  "service":
  {"name": "node_exporter",
   "tags": ["node_exporter", "prometheus"],
   "port": 9100
  }
}

The tags are optional, but they are useful in identifying services. They can also be used to filter what services Prometheus will use (which I’ll explain later in this post).

If the above is added in your /etc/consul.d directory on your agents, you can then run consul reload for the new service to be picked up. The agent will then inform the Server that this service exists on this node.

$ consul catalog services                      
consul
node_exporter

$ consul catalog nodes -service=node_exporter     
Node                        ID        Address        DC
config-01                   6652b349  192.168.0.220  noisepalace
db-01                       b403eac7  192.168.0.229  noisepalace
exodus.noisepalace.home     e378b4ed  192.168.0.252  noisepalace
git-01                      9f30a62e  192.168.0.223  noisepalace
glap                                  192.168.0.234  noisepalace
meshuggah                   2d6e78b5  192.168.0.7    noisepalace
ns-03                       e300899f  192.168.0.221  noisepalace
teramaze.noisepalace.home   c10a3be9  192.168.0.236  noisepalace
testament.noisepalace.home  0b1c0103  192.168.0.253  noisepalace
tpap                                  192.168.0.245  noisepalace
vanhalen.noisepalace.home   b74d9bd5  192.168.0.3    noisepalace
vektor                      ad374347  192.168.0.251  noisepalace
vpn-01                      15c80eaa  192.168.0.222  noisepalace
vps-shme                    9a29d4ca  192.168.100.1  noisepalace
vyos-01                     15659201  192.168.0.225  noisepalace

As the Agent is informing the Server of what services it has, rather than the Server defining what services exist on the agents, this forms the basis of automatic service discovery. It is no longer dependent on what a Server has configured.

Services without an agent

Not all services you will monitor can run an agent. For example, you can install a Node Exporter on OpenWRT (written in Lua rather than Go), but OpenWRT does not support running Consul.

To add external services, I have found adding them through the API is the easiest method.

Create a file that looks like the following: -

{
  "Node": "tpap",
  "Address": "192.168.0.245",
  "NodeMeta": {
    "external-node": "true",
    "external-probe": "true"
  },
  "Service": {
    "ID": "node_exporter",
    "Service": "node_exporter",
    "Tags": ["node_exporter", "prometheus"],
    "Port": 9100
  },
  "Checks": [
    {
      "Name": "http-check",
      "status": "passing",
      "Definition": {
        "http": "http://192.168.0.245:9100",
        "interval": "30s"
      }
    }
  ]
}

The above defines the External Node (in this case, an OpenWRT router), the service running on it, and a basic health check.

To apply this to Consul, run curl --request PUT --data @external.json localhost:8500/v1/catalog/register. This needs to run on one of your Consul servers.

Afterwards, the service will appear in your Consul catalog.

Prometheus Integration

To start making use of Consul with Prometheus, the prometheus.yml file will need updating with the details of your Consul server(s). An example configuration is below: -

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'consul'
    consul_sd_configs:
      - server: '192.168.0.7:8500'
    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,prometheus,.*
        action: keep
      - source_labels: [__meta_consul_service]
        target_label: job

The first job is the standard Prometheus endpoint. The second job however talks to Consul and retrieves the services.

As noted earlier, you can use Tags to filter what services are used. By default Consul adds the consul service into its catalog of existing services, which does not expose a Prometheus-compatible endpoint natively (it can be enabled, but it requires some changes to the default endpoint).

By using tags, you can filter out the Consul service. You can also have different scrape configurations for different kinds of jobs, while still using Consul for discovery of the services.

Different kinds of jobs?

A good example of where you might want to have different scrape configurations, using the Consul cluster, is something like the Blackbox Exporter. The exporter itself runs on a server, but it is effectively a proxy for HTTP(S), ICMP, DNS, TCP and UDP requests to an arbitrary list of endpoints (e.g. Google DNS, your ISP-provided home router, a Roku smart TV device etc).

The Blackbox Exporter configuration in Prometheus requires relabelling the endpoint you’re targeting to be proxied via the exporter itself, e.g.

  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [icmp_ipv4]
    static_configs:
      - targets:
        - 192.168.0.1
        - 192.168.0.3
        - 192.168.0.7
        - 192.168.0.40
        - 192.168.0.42
        - 192.168.0.220
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115

This style of configuration is different from what you’d require for the Node Exporter, hence you would use a different scrape job to get the endpoints from Consul.

Prometheus Targets

Looking at the targets and service discovery section in the Prometheus UI, you’ll see the following when it scrapes from Consul

Prometheus Consul Service Discovery

Prometheus Consul Targets

Now, whenever a new machine is added in my network (running Node Exporter and the Consul Agent), Prometheus will pick it up on it’s next scrape of Consul.

Services Discovered!

I’m still in the process of moving my home Prometheus setup to use Consul, but already I’m benefiting from it.

I’m also deploying Consul and the services via Salt rather than Ansible (although I’ll save that for another blog post…), meaning the moment a machine is added to the Salt master, it will soon be monitored by Prometheus. Perfect!