24 KiB

Raw Blame History

Prometheus+Grafana in docker

guide-by-example

WORK IN PROGRESS
Loki and caddy monitoring parts are not finished yet

Purpose

Monitoring of the host and the running cointaners.

Most of the stuff here is based off the magnificent stefanprodan/dockprom.
So maybe just go play with that.

Chapters

Core prometheus+grafana - to get nice dashboards with metrics from docker host and containers
Pushgateway - how to use it to allow pushing metrics in to prometheus from anywhere
Alertmanager - how to use it for notifications
Loki - how to do the above things but for logs, not just metrics
Caddy monitoring - adding dashboard for reverse proxy info

Overview

Good youtube overview of Prometheus.

Prometheus is an open source system for monitoring and alerting, written in golang.
It periodicly collects metrics from configured targets, makes these metrics available for visualization, and can trigger alerts.
Prometheus is relatively young project, it is a pull type monitoring.

Glossary.

Prometheus Server is the core of the system, responsible for
- pulling new metrics
- storing the metrics in a database and evaluating them
- making metrics available through PromQL API
Targets - machines, services, applications that are monitored.
These need to have an exporter.
- exporter - a script or a service that gathers metrics on the target, converts them to prometheus server format, and exposes them at an endpoint so they can be pulled
Alertmanager - responsible for handling alerts from Prometheus Server, and sending notifications through email, slack, pushover,.. In this setup ntfy webhook will be used.
Grafana comes with own alerts, but grafana kinda feels... b-tier
pushgateway - allows push type of monitoring. Should not be overused as it goes against the pull philosophy of prometheus. Most commonly it is used to collect data from batch jobs, or from services that have short execution time. Like a backup script.
Grafana - for web UI visualization of the collected metrics

Files and directory structure

/home/
 └── ~/
     └── docker/
         └── prometheus/
             ├──── grafana_data/
             ├──── prometheus_data/
             ├── docker-compose.yml
             ├── .env
             └── prometheus.yml

grafana_data/ - a directory where grafana stores its data
prometheus_data/ - a directory where prometheus stores its database and data
.env - a file containing environment variables for docker compose
docker-compose.yml - a docker compose file, telling docker how to run the containers
prometheus.yml - a configuration file for prometheus

The three files must be provided.
The directories are created by docker compose on the first run.

docker-compose

Prometheus - Container with some extra commands run at the start up. Setting stuff like storage, data rentetion (500hours - 20 days)... Bind mounted prometheus_data for persistent storage and prometheus.yml for some basic configuration.
Grafana - Cotainer, bind mounted directory for persistent data storage
NodeExporter - an exporter for linux machines, in this case gathering the metrics of the linux machine runnig docker, like uptime, cpu load, memory use, network bandwidth use, disk space,...
Also bind mount of some system directories to have access to required info.
cAdvisor - an exporter for gathering docker containers metrics, showing cpu, memory, network use of each container
Runs in privileged mode and has some bind mounts of system directories to have access to required info.

Note - ports are only expose, since expectation of use of a reverse proxy and accessing the services by hostname, not ip and port.

docker-compose.yml

services:

  # MONITORING SYSTEM AND THE METRICS DATABASE
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    hostname: prometheus
    user: root
    restart: unless-stopped
    depends_on:
      - cadvisor
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=500h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus_data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    expose:
      - 9090:9090
    labels:
      org.label-schema.group: "monitoring"

  # WEB BASED UI VISUALISATION OF METRICS
  grafana:
    image: grafana/grafana:9.3.6
    container_name: grafana
    hostname: grafana
    user: root
    restart: unless-stopped
    env_file: .env
    volumes:
      - ./grafana_data:/var/lib/grafana
    expose:
      - 3000
    labels:
      org.label-schema.group: "monitoring"

  # HOST LINUX MACHINE METRICS EXPORTER
  nodeexporter:
    image: prom/node-exporter:v1.5.0
    container_name: nodeexporter
    hostname: nodeexporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    expose:
      - 9100
    labels:
      org.label-schema.group: "monitoring"

  # DOCKER CONTAINERS METRICS EXPORTER
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.1
    container_name: cadvisor
    hostname: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
    expose:
      - 3000
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

.env

# GENERAL
MY_DOMAIN=example.com
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava

# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# GRAFANA EMAIL
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa

All containers must be on the same network.
Which is named in the .env file.
If one does not exist yet: docker network create caddy_net

prometheus.yml

Official documentation.

Contains the bare minimum setup of targets from where metrics are to be pulled.
Stefanprodan gives a custom shorter scrape intervals, but I feel thats not really necessary.

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'nodeexporter'
    static_configs:
      - targets: ['nodeexporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Reverse proxy

Caddy v2 is used, details here.

Caddyfile

graf.{$MY_DOMAIN} {
  reverse_proxy grafana:3000
}

prom.{$MY_DOMAIN} {
  reverse_proxy prometheus:9090
}

First run and Grafana configuration

login admin/admin to graf.example.com, change the password
add Prometheus as a Data source in configuration
set URL to http://prometheus:9090
import dashboards from json files in this repo

These dashboards are the preconfigured ones from stefanprodan/dockprom with few changes.
docker_host.json did not show free disk space, it needed fstype changed from aufs to ext4. Also a fix for host network monitoring not showing traffick. And in all of them the time interval is set to show last 1h instead of last 15m

docker_host.json - dashboard showing linux host machine metrics
docker_containers.json - dashboard showing docker containers metrics, except the ones labeled as monitoring in the compose file
monitoring_services.json - dashboar showing docker containers metrics of containers that are labeled monitoring

Pushgateway

The setup and real world use of pushgateway, along with small steps when learning it are in the repo - Veeam Prometheus Grafana
Including pushing information from windows powershell.

Alertmanager

Several changes are needed

New container - alertmanager added to the compose file.
New file - alertmanager.yml bind mounted in the alertmanager container.
This file contains configuration about where and how to deliver alerts.
A selfhosted ntfy webhook is used that gets alerts to a phone app.
New file - alert.rules mounted in to prometheus container
This files defines when value of some metric becomes an alert event.
Changed file - prometheus.yml added alerting section and the path to the rule_files

docker-compose.yml

services:

  # MONITORING SYSTEM AND THE METRICS DATABASE
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    hostname: prometheus
    user: root
    restart: unless-stopped
    depends_on:
      - cadvisor
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=500h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus_data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules:/etc/prometheus/rules/alert.rules
    expose:
      - 9090:9090
    labels:
      org.label-schema.group: "monitoring"

  # WEB BASED UI VISUALISATION OF METRICS
  grafana:
    image: grafana/grafana:9.3.6
    container_name: grafana
    hostname: grafana
    user: root
    restart: unless-stopped
    env_file: .env
    volumes:
      - ./grafana_data:/var/lib/grafana
    expose:
      - 3000
    labels:
      org.label-schema.group: "monitoring"

  # HOST LINUX MACHINE METRICS EXPORTER
  nodeexporter:
    image: prom/node-exporter:v1.5.0
    container_name: nodeexporter
    hostname: nodeexporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    expose:
      - 9100
    labels:
      org.label-schema.group: "monitoring"

  # DOCKER CONTAINERS METRICS EXPORTER
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.1
    container_name: cadvisor
    hostname: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
    expose:
      - 3000
    labels:
      org.label-schema.group: "monitoring"

  # ALERT MANAGMENT BY PROMETHEUS
  alertmanager:
    image: prom/alertmanager:v0.25.0
    container_name: alertmanager
    hostname: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager.yml:/etc/alertmanager.yml
      - ./alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager.yml'
      - '--storage.path=/alertmanager'
    expose:
      - 9093
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

alertmanager.yml

route:
  receiver: 'email'

receivers:
  - name: "ntfy"
    webhook_configs:
    - url: 'https://ntfy.example.com/alertmanager'
      send_resolved: true
      
  - name: 'email'
    email_configs:
    - to: 'whoever@example.com'
      from: 'alertmanager@example.com'
      smarthost: smtp-relay.sendinblue.com:587
      auth_username: '<registration_email@gmail.com>'
      auth_identity: '<registration_email@gmail.com>'
      auth_password: '<long ass generated SMTP key>'

alert.rules

groups:
  - name: host
    rules:
      - alert: DiskSpaceLow
        expr: sum(node_filesystem_free_bytes{fstype="ext4"}) > 19
        for: 10s
        labels:
          severity: critical
        annotations:
          description: "Diskspace is low!"

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'nodeexporter'
    static_configs:
      - targets: ['nodeexporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets: 
      - 'alertmanager:9093'

rule_files:
  - '/etc/prometheus/rules/alert.rules'

test:
curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"blabla"}}]' https://alert.example.com/api/v1/alerts

reload rules curl -X POST http://admin:admin@<host-ip>:9090/-/reload

Loki

Loki is made by the grafana team, and they call it Prometheus for logs.
It is a push type monitoring, where an agent - promtail pushes logs on to a Loki instance, or loki-docker-driver is installed on a docker host and log pushing is set either globally in /etc/docker/daemon.json or per container in compose files.

What this example will set out to do is monitor logs of a minecraft server. Have some dashboards about logs and have an alert when a player joins.
Its just proof of concept, as there are prometheus exporters for minecraft, so digging through logs to get alert is less optimal.

What needs to be done:

New container - loki added to the compose file. Loki stores logs and makes them available for grafana to visualize.
New file - loki-docker-config.yml bind mounted in the loki container.
The file is all default, except for alertmanager url.
install loki-docker-driver
docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions

adding logging section to compose files of a containers that should be monitored, like so:

services:
  whoami:
    image: "containous/whoami"
    container_name: "whoami"
    hostname: "whoami"
    ports:
     - 88:80
    logging:
      driver: "loki"
      options:
        loki-url: "http://localhost:3100/loki/api/v1/push"

Promtail can be used to push logs to loki, older version of this readme has details - here.

docker-compose.yml

services:

  # MONITORING SYSTEM AND THE METRICS DATABASE
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    hostname: prometheus
    user: root
    restart: unless-stopped
    depends_on:
      - cadvisor
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=500h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus_data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert.rules:/etc/prometheus/rules/alert.rules
    expose:
      - 9090:9090
    labels:
      org.label-schema.group: "monitoring"

  # WEB BASED UI VISUALISATION OF METRICS
  grafana:
    image: grafana/grafana:9.3.6
    container_name: grafana
    hostname: grafana
    user: root
    restart: unless-stopped
    env_file: .env
    volumes:
      - ./grafana_data:/var/lib/grafana
    expose:
      - 3000
    labels:
      org.label-schema.group: "monitoring"

  # HOST LINUX MACHINE METRICS EXPORTER
  nodeexporter:
    image: prom/node-exporter:v1.5.0
    container_name: nodeexporter
    hostname: nodeexporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    expose:
      - 9100
    labels:
      org.label-schema.group: "monitoring"

  # DOCKER CONTAINERS METRICS EXPORTER
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.1
    container_name: cadvisor
    hostname: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
    expose:
      - 3000
    labels:
      org.label-schema.group: "monitoring"

  # ALERT MANAGMENT BY PROMETHEUS
  alertmanager:
    image: prom/alertmanager:v0.25.0
    container_name: alertmanager
    hostname: alertmanager
    restart: unless-stopped
    volumes:
      - ./alertmanager.yml:/etc/alertmanager.yml
      - ./alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager.yml'
      - '--storage.path=/alertmanager'
    expose:
      - 9093
    labels:
      org.label-schema.group: "monitoring"

  # LOG MANAGMENT WITH LOKI
  loki:
    image: grafana/loki:2.7.3
    container_name: loki
    hostname: loki
    user: root
    restart: unless-stopped
    volumes:
      - ./loki_data:/loki
      - ./loki-docker-config.yml:/etc/loki-docker-config.yml
    command:
      - '-config.file=/etc/loki-docker-config.yml'
    ports:
      - 3100:3100
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

loki-docker-config.yml

auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://alertmanager:9093

analytics:
  reporting_enabled: false

Logs in grafana

Now with the driver installed, config files in place, compose edited,..

In grafana, loki needs to be added as a datasource.
If everything works as it should, there should be no red notice, down left side, only green: Data source connected and labels found.
In Explore section, input set to Builder, picking from dropdown filter menu container_name = minecraft, and hitting run query.. this should result in seeing minecraft logs and their volume/time graph.
To recreat this Explore view as a dashboard, as a practice...
- new dashboard, panel
- datasource - Loki
- query - count_over_time({compose_service="minecraft"} |= `` [1m])
  switch from builder to code is needed to paste it
- Query options - Min interval=1m
- transformation - Rename by regex - (.*) - Logs
- time series graph
- Title - Logs volume
- transparent background
- legend off
- graph styles - bar
- graph styles - bar - point size = 10
- stack series - normal
- color scheme - single color

Alerts for Loki

Alerts section in grafana
Alert rules, new alert
- 1. Set a query and alert condition
  - A
  - now-5min to now
  - container_name=minecraft
  - line contains=joined the game
  - B
  - Reduce - because alerts can only work with single number
  - Function=Last
  - Input=A
  - Mode=Strict
  - C - the actual condition for alert
  - Input=B
  - is above 0
  - click - Make this the alert condition
- 1. Alert evaluation behavior
  - evaluate every 5m for 0s
  - Configure no data and error handling
  - Alert state if no data or all values are null=OK
- 1. Add details for your alert
  - Rule name=Minecraft-player-joined
  - Folder, add new, "Alerts"
  - Group, add new, "Docker"
- 1. Notifications
  - nothing
- Save and exit
Contact points
- New contact point
- Name = ntfy
- Contact point type = Webhook
- URL = https://ntfy.example.com/grafana
- Test; Save
Notification policies
- edit default
- Default contact point = ntfy

Caddy monitoring

Update

Manual image update:

docker-compose pull
docker-compose up -d
docker image prune

Backup and restore

Backup

Using borg that makes daily snapshot of the entire directory.

Restore

down the prometheus containers docker-compose down
delete the entire prometheus directory
from the backup copy back the prometheus directory
start the containers docker-compose up -d

24 KiB Raw Blame History

Prometheus+Grafana in docker

guide-by-example

Purpose

Chapters

Overview

Files and directory structure

docker-compose

prometheus.yml

Reverse proxy

First run and Grafana configuration

Pushgateway

Alertmanager

Loki

Logs in grafana

Alerts for Loki

Caddy monitoring

Update

Backup and restore

Backup

Restore

24 KiB

Raw Blame History