selfhosted-apps-docker/prometheus_grafana_loki
DoTheEvo 15c3da6382 update 2023-08-07 18:22:15 +02:00
..
dashboards update 2023-03-19 14:09:36 +01:00
readme.md update 2023-08-07 18:22:15 +02:00

readme.md

Prometheus + Grafana + Loki in docker

guide-by-example

logo

Purpose

Monitoring of a docker host and cointaners.

Monitoring in this case means gathering and showing information on how services or machines or containers are running.
Can be cpu, io, ram, disk use... can be number of http requests, errors, results of backups, or a world map showing location of IP addresses that access your services.
Prometheus deals with metrics. Loki deals with logs. Grafana is there to show the data on dashboards.

Most of the prometheus stuff here is based off the magnificent stefanprodan/dockprom.

Chapters

dashboards_pic

Overview

Good youtube overview of Prometheus.

Prometheus is an open source system for monitoring and alerting, written in golang.
It periodically collects metrics from configured targets, makes these metrics available for visualization, and can trigger alerts.
Prometheus is relatively young project, it is a pull type monitoring.

Glossary.

  • Prometheus Server is the core of the system, responsible for
    • pulling new metrics
    • storing the metrics in a database and evaluating them
    • making metrics available through PromQL API
  • Targets - machines, services, applications that are monitored.
    These need to have an exporter.
    • exporter - a script or a service that gathers metrics on the target, converts them to prometheus server format, and exposes them at an endpoint so they can be pulled
  • Alertmanager - responsible for handling alerts from Prometheus Server, and sending notifications through email, slack, pushover,.. In this setup ntfy webhook will be used.
  • pushgateway - allows push type of monitoring. Meaning a machine anywhere in the world can push data in to your prometheus. Should not be overused as it goes against the pull philosophy of prometheus.
  • Grafana - for web UI visualization of the collected metrics

prometheus components

Files and directory structure

/home/
 └── ~/
     └── docker/
         └── monitoring/
             ├── 🗁 grafana_data/
             ├── 🗁 prometheus_data/
             ├── 🗋 docker-compose.yml
             ├── 🗋 .env
             └── 🗋 prometheus.yml
  • grafana_data/ - a directory where grafana stores its data
  • prometheus_data/ - a directory where prometheus stores its database and data
  • .env - a file containing environment variables for docker compose
  • docker-compose.yml - a docker compose file, telling docker how to run the containers
  • prometheus.yml - a configuration file for prometheus

The three files must be provided.
The directories are created by docker compose on the first run.

docker-compose

  • Prometheus - The official image used. Few extra commands passing configuration. Of note is 240 hours(10days) retention policy.
  • Grafana - The official image used. Bind mounted directory for persistent data storage. User sets as root, as it solves issues I am lazy to investigate, likely me editing some files as root.
  • NodeExporter - An exporter for linux machines, in this case gathering the metrics of the docker host, like uptime, cpu load, memory use, network bandwidth use, disk space,...
    Also bind mount of some system directories to have access to required info.
  • cAdvisor - An exporter for gathering docker containers metrics, showing cpu, memory, network use of each container
    Runs in privileged mode and has some bind mounts of system directories to have access to required info.

Note - ports are only expose, since expectation of use of a reverse proxy and accessing the services by hostname, not ip and port.

docker-compose.yml

services:

  # MONITORING SYSTEM AND THE METRICS DATABASE
  prometheus:
    image: prom/prometheus:v2.42.0
    container_name: prometheus
    hostname: prometheus
    user: root
    restart: unless-stopped
    depends_on:
      - cadvisor
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=240h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus_data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    expose:
      - "9090"
    labels:
      org.label-schema.group: "monitoring"

  # WEB BASED UI VISUALISATION OF METRICS
  grafana:
    image: grafana/grafana:9.4.3
    container_name: grafana
    hostname: grafana
    user: root
    restart: unless-stopped
    env_file: .env
    volumes:
      - ./grafana_data:/var/lib/grafana
    expose:
      - "3000"
    labels:
      org.label-schema.group: "monitoring"

  # HOST LINUX MACHINE METRICS EXPORTER
  nodeexporter:
    image: prom/node-exporter:v1.5.0
    container_name: nodeexporter
    hostname: nodeexporter
    restart: unless-stopped
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    expose:
      - "9100"
    labels:
      org.label-schema.group: "monitoring"

  # DOCKER CONTAINERS METRICS EXPORTER
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.1
    container_name: cadvisor
    hostname: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
    expose:
      - "3000"
    labels:
      org.label-schema.group: "monitoring"

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

.env

# GENERAL
DOCKER_MY_NETWORK=caddy_net
TZ=Europe/Bratislava

# GRAFANA
GF_SECURITY_ADMIN_USER=admin
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
# GRAFANA EMAIL
GF_SMTP_ENABLED=true
GF_SMTP_HOST=smtp-relay.sendinblue.com:587
GF_SMTP_USER=example@gmail.com
GF_SMTP_PASSWORD=xzu0dfFhn3eqa

All containers must be on the same network.
Which is named in the .env file.
If one does not exist yet: docker network create caddy_net

prometheus.yml

Official documentation.

Contains the bare minimum settings of targets from where metrics are to be pulled.

prometheus.yml

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'nodeexporter'
    static_configs:
      - targets: ['nodeexporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

Reverse proxy

Caddy v2 is used, details here.

Caddyfile

graf.{$MY_DOMAIN} {
  reverse_proxy grafana:3000
}

prom.{$MY_DOMAIN} {
  reverse_proxy prometheus:9090
}

First run and Grafana configuration

  • Login admin/admin to graf.example.com, change the password.
  • Add Prometheus as a Data source in Configuration
    Set URL to http://prometheus:9090
  • Import dashboards from json files in this repo

These dashboards are the preconfigured ones from stefanprodan/dockprom with few changes.
Docker host dashboard did not show free disk space for me, had to change fstype from aufs to ext4. Also included is a fix for host network monitoring not showing traffick. In all of them the default time interval is set to 1h instead of 15m

  • docker_host.json - dashboard showing linux docker host metrics
  • docker_containers.json - dashboard showing docker containers metrics, except the ones labeled as monitoring in the compose file
  • monitoring_services.json - dashboar showing docker containers metrics of containers that are labeled monitoring

interface-pic



PromQL basics

My understanding of this shit..

  • Prometheus stores metrics, each metric has a name, like cpu_temp.
  • the metrics values are stored as time series, just simple - timestamped values
    [43 @1684608467][41 @1684608567][48 @1684608667].
  • This metric has labels [name="server-19", state="idle", city="Frankfurt"].
    These allow far better targeting of the data, or as they say multidimensionality.

Queries to retrieve metrics.

  • cpu_temp - simple query will show values over whatever time period is selected in the interface.
  • cpu_temp{state="idle"} - will narrow down results by applying a label.
    cpu_temp{state="idle", name="server-19"} - multiple labels narrow down results.

A query can return various data type, kinda tricky concept is difference between:

  • instant vector - query returns a single value with a single timestamp. It is simple and intuitive. All the above examples are instant vectors.
    Of note, there is no thinking about time range here. That is few layers above, if one picks last 1h or last 7 days... that plays no role here, this is a query response datatype and it is still instant vector - a single value in point of time.

  • range vector - returns multiple values with a single timestamp
    This is needed by some query functions but on its own useless.
    A useless example would be cpu_temp[10m]. This query first looks at the last timestamp data, then it would take all data points within the previous 10m before that one timestamp, and return all those values. This colletion would have a single timestamp.
    This functionality allows use of various functions that can do complex tasks.
    Actual useful example of a range vector would be changes(cpu_temp[10m]) where the function changes() would take that range vector info, look at those 10min of data and return a single value, telling how many times the value of that metric changed in those 10 min.

Links



Pushgateway

Gives freedom to push information in to prometheus from anywhere.

Be aware that it should not be abused to turn prometheus in to push type monitoring. It is only intented for specific situations.

The setup

To add pushgateway functionality to the current stack:

  • New container pushgateway added to the compose file.

    docker-compose.yml
    services:
    
    # PUSHGATEWAY FOR PROMETHEUS
    pushgateway:
      image: prom/pushgateway:v1.5.1
      container_name: pushgateway
      hostname: pushgateway
      restart: unless-stopped
      command:
        - '--web.enable-admin-api'    
      expose:
        - "9091"
    
    networks:
    default:
      name: $DOCKER_MY_NETWORK
      external: true
    
  • Adding pushgateway to the Caddyfile of the reverse proxy so that it can be reached at https://push.example.com

    Caddyfile
    push.{$MY_DOMAIN} {
        reverse_proxy pushgateway:9091
    }
    
  • Adding pushgateway as a scrape point to prometheus.yml
    Of note is honor_labels set to true, which makes sure that conflicting labels like job, set during push are kept over labels set in prometheus.yml for the scrape job. Docs.

    prometheus.yml
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: 'pushgateway-scrape'
        scrape_interval: 60s
        honor_labels: true
        static_configs:
          - targets: ['pushgateway:9091']
    

The basics

push-web

To test pushing some metric, execute in linux:

  • echo "some_metric 3.14" | curl --data-binary @- https://push.example.com/metrics/job/blabla/instance/whatever
  • Visit push.example.com and see the metric there.
  • In Grafana > Explore > query for some_metric and see its value there.

In that command you see the metric itself: some_metric and it's value: 3.14
But there are also labels being set as part of the url. One label named job, which is required, but after that it's whatever you want. They just need to be in pairs - label name and label value.

The metrics sit on the pushgateway forever, unless deleted or container shuts down. Prometheus will not remove the metrics after scraping, it will keep scraping the pushgateway, every X seconds, and store the value that sits there with the timestamp of scraping.

To wipe the pushgateway clean
curl -X PUT https://push.example.com/api/v1/admin/wipe

The real world use

Linked above is a guide-by-example with more info on pushgateway setup.
A real world use to monitor backups, along with pushing metrics from windows in powershell.

veeam-dash



Alertmanager

To send a notification about some metric breaching some preset condition.
Notifications channels used here are email and ntfy

Note
I myself am not planning on using alertmanager. Grafana can do alerts for both logs and metrics.

alert

The setup

To add alertmanager to the current stack:

  • New file - alertmanager.yml to be bind mounted in alertmanager container.
    This is the configuration on how and where to deliver alerts.
    Correct smtp or ntfy info needs to be filled out.

    alertmanager.yml
    route:
      receiver: 'email'
    
    receivers:
      - name: 'ntfy'
        webhook_configs:
        - url: 'https://ntfy.example.com/alertmanager'
          send_resolved: true
    
      - name: 'email'
        email_configs:
        - to: 'whoever@example.com'
          from: 'alertmanager@example.com'
          smarthost: smtp-relay.sendinblue.com:587
          auth_username: '<registration_email@gmail.com>'
          auth_identity: '<registration_email@gmail.com>'
          auth_password: '<long ass generated SMTP key>'
    
  • New file - alert.rules to be bind mounted in to prometheus container
    This file defines at what value a metric becomes an alert event.

    alert.rules
    groups:
      - name: host
        rules:
          - alert: DiskSpaceLow
            expr: sum(node_filesystem_free_bytes{fstype="ext4"}) > 19
            for: 10s
            labels:
              severity: critical
            annotations:
              description: "Diskspace is low!"
    
  • Changed prometheus.yml. Added alerting section that points to alertmanager container, and also set path to a rules file.

    prometheus.yml
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: 'nodeexporter'
        static_configs:
          - targets: ['nodeexporter:9100']
    
      - job_name: 'cadvisor'
        static_configs:
          - targets: ['cadvisor:8080']
    
      - job_name: 'prometheus'
        static_configs:
          - targets: ['localhost:9090']
    
    alerting:
      alertmanagers:
      - scheme: http
        static_configs:
        - targets: 
          - 'alertmanager:9093'
    
    rule_files:
      - '/etc/prometheus/rules/alert.rules'
    
  • New container - alertmanager added to the compose file and prometheus container has bind mount rules file added.

    docker-compose.yml
    services:
    
      # MONITORING SYSTEM AND THE METRICS DATABASE
      prometheus:
        image: prom/prometheus:v2.42.0
        container_name: prometheus
        hostname: prometheus
        user: root
        restart: unless-stopped
        depends_on:
          - cadvisor
        command:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--storage.tsdb.retention.time=240h'
          - '--web.enable-lifecycle'
        volumes:
          - ./prometheus_data:/prometheus
          - ./prometheus.yml:/etc/prometheus/prometheus.yml
          - ./alert.rules:/etc/prometheus/rules/alert.rules
        expose:
          - "9090"
        labels:
          org.label-schema.group: "monitoring"
    
      # ALERT MANAGMENT FOR PROMETHEUS
      alertmanager:
        image: prom/alertmanager:v0.25.0
        container_name: alertmanager
        hostname: alertmanager
        user: root
        restart: unless-stopped
        volumes:
          - ./alertmanager.yml:/etc/alertmanager.yml
          - ./alertmanager_data:/alertmanager
        command:
          - '--config.file=/etc/alertmanager.yml'
          - '--storage.path=/alertmanager'
        expose:
          - "9093"
        labels:
          org.label-schema.group: "monitoring"
    
    networks:
      default:
        name: $DOCKER_MY_NETWORK
        external: true
    
  • Adding alertmanager to the Caddyfile of the reverse proxy so that it can be reached at https://alert.example.com. Not necessary, but useful as it allows to send alerts from anywhere, not just from prometheus, or other containers on same docker network.

    Caddyfile
    alert.{$MY_DOMAIN} {
        reverse_proxy alertmanager:9093
    }
    

The basics

alert

Once above setup is done, an alert about low disk space should fire and a notification email should come.
In alertmanager.yml a switch from email to ntfy can be done.

Useful

  • alert from anywhere using curl:
    curl -H 'Content-Type: application/json' -d '[{"labels":{"alertname":"blabla"}}]' https://alert.example.com/api/v1/alerts
  • reload rules:
    curl -X POST https://prom.example.com/-/reload

stefanprodan/dockprom has more detailed section on alerting worth checking out.



Loki-logo

Loki

Loki is a log aggregation tool, made by the grafana team. Sometimes called a Prometheus for logs, it's a push type monitoring.

It uses LogQL for queries, which is similar to PromQL in its use of labels.

The official documentation overview

There are two ways to push logs to Loki from a docker container.

  • Loki-docker-driver installed on a docker host and log pushing is set either globally in /etc/docker/daemon.json or per container in compose files.
    It's the simpler, easier way, but lacks fine control over the logs being pushed.
  • Promtail deployed as an another container, with bind mount of logs it should scrape, and bind mount of its config file. This config file is very powerful, giving a lot of control how logs are processed and pushed.

loki_arch

Loki setup

  • New container - loki added to the compose file.
    Note the port 3100 is actually mapped to the host, allowing localhost:3100 from driver to work.

    docker-compose.yml
    services:
    
      # LOG MANAGMENT WITH LOKI
      loki:
        image: grafana/loki:main-0295fd4
        container_name: loki
        hostname: loki
        user: root
        restart: unless-stopped
        volumes:
          - ./loki_data:/loki
          - ./loki-config.yml:/etc/loki-config.yml
        command:
          - '-config.file=/etc/loki-config.yml'
        ports:
          - "3100:3100"
        labels:
          org.label-schema.group: "monitoring"
    
    networks:
      default:
        name: $DOCKER_MY_NETWORK
        external: true
    
  • New file - loki-config.yml bind mounted in the loki container.
    The config here comes from the official example with some changes.

    • URL changed for this setup.
    • Compactor section is added, to have control over data retention.
    • Fixing error - "too many outstanding requests", discussion here.
      It turns off parallelism, both split by time interval and shards split.
    loki-config.yml
    auth_enabled: false
    
    server:
      http_listen_port: 3100
    
    common:
      path_prefix: /loki
      storage:
        filesystem:
          chunks_directory: /loki/chunks
          rules_directory: /loki/rules
      replication_factor: 1
      ring:
        kvstore:
          store: inmemory
    
    # --- disable splitting to fix "too many outstanding requests"
    
    query_range:
      parallelise_shardable_queries: false
    
    # ---  compactor to have control over length of data retention
    
    compactor:
      working_directory: /loki/compactor
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
    
    limits_config:
      retention_period: 240h
      split_queries_by_interval: 0  # part of disable splitting fix
    
    # -------------------------------------------------------
    
    schema_config:
      configs:
        - from: 2020-10-24
          store: boltdb-shipper
          object_store: filesystem
          schema: v11
          index:
            prefix: index_
            period: 24h
    
    ruler:
      alertmanager_url: http://alertmanager:9093
    
    analytics:
      reporting_enabled: false
    
  • loki-docker-driverdriver

    • Install loki-docker-driver
      docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions
      To check if it's installed and enabled: docker plugin ls

    • Containers that should be monitored usind loki-docker-driver need logging section in their compose.

      docker-compose.yml
      services:
      
        whoami:
          image: "containous/whoami"
          container_name: "whoami"
          hostname: "whoami"
          logging:
            driver: "loki"
            options:
              loki-url: "http://localhost:3100/loki/api/v1/push"
      
  • promtail

    • Containers that should be monitored with promtail need it added to their compose file, and made sure that it has access to the log files.

      minecraft-docker-compose.yml
      services:
      
        minecraft:
          image: itzg/minecraft-server
          container_name: minecraft
          hostname: minecraft
          restart: unless-stopped
          env_file: .env
          tty: true
          stdin_open: true
          ports:
            - 25565:25565     # minecraft server players connect
          volumes:
            - ./minecraft_data:/data
      
        # LOG AGENT PUSHING LOGS TO LOKI
        promtail:
          image: grafana/promtail
          container_name: minecraft-promtail
          hostname: minecraft-promtail
          restart: unless-stopped
          volumes:
            - ./minecraft_data/logs:/var/log/minecraft:ro
            - ./promtail-config.yml:/etc/promtail-config.yml
          command:
            - '-config.file=/etc/promtail-config.yml'
      
      networks:
        default:
          name: $DOCKER_MY_NETWORK
          external: true
      
      caddy-docker-compose.yml
      services:
      
        caddy:
          image: caddy
          container_name: caddy
          hostname: caddy
          restart: unless-stopped
          env_file: .env
          ports:
            - "80:80"
            - "443:443"
            - "443:443/udp"
          volumes:
            - ./Caddyfile:/etc/caddy/Caddyfile
            - ./caddy_config:/data
            - ./caddy_data:/config
            - ./caddy_logs:/var/log/caddy
      
        # LOG AGENT PUSHING LOGS TO LOKI
        promtail:
          image: grafana/promtail
          container_name: caddy-promtail
          hostname: caddy-promtail
          restart: unless-stopped
          volumes:
            - ./caddy_logs:/var/log/caddy:ro
            - ./promtail-config.yml:/etc/promtail-config.yml
          command:
            - '-config.file=/etc/promtail-config.yml'
      
      networks:
        default:
          name: $DOCKER_MY_NETWORK
          external: true
      
      
    • Generic config file for promtail, needs to be bind mounted

      promtail-config.yml
      clients:
        - url: http://loki:3100/loki/api/v1/push
      
      scrape_configs:
        - job_name: blablabla
          static_configs:
            - targets:
                - localhost
              labels:
                job: blablabla_log
                __path__: /var/log/blablabla/*.log
      

First Loki use in Grafana

  • In grafana, loki needs to be added as a datasource, http://loki:3100
  • In Explore section, switch to Loki as source
    • if loki-docker-driver then filter by container_name or compose_project
    • if promtail then filter by job name set in promtail config in the labels section

If all was set correctly logs should be visible in Grafana.

query

Minecraft Loki example

What can be seen in this example:

  • How to monitor logs of a docker container, a minecraft server.
  • How to visualize the logs in a dashboard.
  • How to set an alert when a specific pattern appears in the logs.
  • How to extract information from log to include it in the alert notification.
  • Basics of grafana alert templates, so that notifications actually look good, and show only relevant info.

Requirements - grafana, loki, minecraft.

logo-minecraft

The objective and overview

The main objective is to get an alert when a player joins the server.
The secondary one is to have a place where recent "happening" on the server can be seen.

Initially loki-docker-driver was used to get logs to Loki, and it was simple and worked nicely. But during alert stage I could not figure out how to extract string from logs and include it in an alert notification. Specificly to not just say that "a player joined", but to have there name of the player that joined.
Switch to promtail solved this, with the use of its pipeline_stages. Which was suprisingly simple and elegant.

The Setup

Promtail container is added to minecraft's compose, with bind mount access to minecraf's logs.

minecraft-docker-compose.yml
services:

  minecraft:
    image: itzg/minecraft-server
    container_name: minecraft
    hostname: minecraft
    restart: unless-stopped
    env_file: .env
    tty: true
    stdin_open: true
    ports:
      - 25565:25565     # minecraft server players connect
    volumes:
      - ./minecraft_data:/data

  # LOG AGENT PUSHING LOGS TO LOKI
  promtail:
    image: grafana/promtail
    container_name: minecraft-promtail
    hostname: minecraft-promtail
    restart: unless-stopped
    volumes:
      - ./minecraft_data/logs:/var/log/minecraft:ro
      - ./promtail-config.yml:/etc/promtail-config.yml
    command:
      - '-config.file=/etc/promtail-config.yml'

networks:
  default:
    name: $DOCKER_MY_NETWORK
    external: true

Promtail's config is similar to the generic config in the previous section.
The only addition is a short pipeline stage with a regex that runs against every log line before sending it to Loki. When a line matches, a label player is added to that log line. The value of that label comes from the named capture group thats part of that regex, the syntax is: (?P<name>group)
This label will be easy to use later in the alert stage.

promtail-config.yml
clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: minecraft
    static_configs:
      - targets:
          - localhost
        labels:
          job: minecraft_logs
          __path__: /var/log/minecraft/*.log
    pipeline_stages:
      - regex:
          expression: .*:\s(?P<player>.*)\sjoined the game$
      - labels:
          player:

Here's regex101 of it, with some data to show how it works and bit of explanation.
Here's the stackoverflow answer that is the source for that config.

regex

In Grafana

  • If Loki is not yet added, it needs to be added as a datasource, http://loki:3100
  • In Explore section, filter, job = minecraft_logs, Run query button... this should result in seeing minecraft logs and their volume/time graph.

This Explore view will be recreated as a dashboard.

Dashboard for minecraft logs

dashboard-minecraft

  • New dashboard, new panel
    • Graph type - Time series
    • Data source - Loki
    • Switch from builder to code
    • query - count_over_time({job="minecraft_logs"} |= `` [1m])
    • Query options - Min interval=1m
    • Transform - Rename by regex Match - (.*) Replace - Logs
    • Title - Logs volume
    • Transparent background
    • Legend off
    • Graph styles - bar
    • Fill opacity - 50
    • Color scheme - single color
    • Save
  • Add another panel
    • Graph type - Logs
    • Data source - Loki
    • Switch from builder to code
      query - {job="minecraft_logs"} |= ""
    • Title - empty
    • Deduplication - Signature or Exact
    • Save

This should create a similar dashboard to the one in the picture above.

Performance tips for grafana loki queries

Alerts in Grafana for Loki

alert-labels

When a player joins minecraft server a log line appears "Bastard joined the game"
An Alert will be set to detect string "joined the game" and send a notification when it occurs.

Now, might be good time to brush up on PromQL / LogQL and the data types they return when a query happens. That instant vector and range vector thingie. As grafana will scream when using range vector.

Create alert rule

  • 1 Set an alert rule name
    • Rule name = Minecraft-player-joined-alert
  • 2 Set a query and alert condition
    • A - Switch to Loki; set Last 5 minutes
      • switch from builder to code
      • count_over_time({job="minecraft_logs"} |= "joined the game" [5m])
    • B - Reduce
      • Function = Last
      • Input = A
      • Mode = Strict
    • C - Treshold
      • Input = B
      • is above 0
      • Make this the alert condition
  • 3 Alert evaluation behavior
    • Folder = "Alerts"
    • Evaluation group (interval) = "five-min"
    • Evaluation interval = 5m
    • For 0s
    • Configure no data and error handling
      • Alert state if no data or all values are null = OK
  • 4 Add details for your alert rule
    • Here is where the label player that was set in promtail is used
      Summary = {{ $labels.player }} joined the Minecraft server.
    • Can also pass values from expressions by targeting A/B/C/.. from step2
      Description = Number of players that joined in the last 5 min: {{ $values.B }}
  • 5 Notifications
    • nothing
  • Save and exit

Contact points

  • New contact point
  • Name = ntfy
  • Integration = Webhook
  • URL = https://ntfy.example.com/grafana
    or if grafana-to-ntfy is already setup then http://grafana-to-ntfy:8080
    but also credentials need to be set.
  • Title = {{ .CommonAnnotations.summary }}
  • Message = I put in empty space unicode character
  • Disable resolved message = check
  • Test
  • Save

Notification policies

  • Edit default
  • Default contact point = ntfy
  • Save

After all this, there should be notification coming when a player joins.

grafana-to-ntfy

For alerts one can use ntfy but on its own alerts from grafana are just plain text json.
Here's how to setup grafana-to-ntfy, to make alerts look good.

ntfy



Templates

Not really used here, but they are pain and there's some info as it took embarrassingly long to find that {{ .CommonAnnotations.summary }} for the title.

  • Testing should be done in contact point when editing, useful Test button that allows you to send alerts with custom values.
  • To define a template.
  • To call a template.
  • My big mistake when playing with this was missing a dot.
    In Contact point, in Title/Message input box.
    • correct one - {{ template "test" . }}
    • the one I had - {{ template "test" }}
  • So yeah, dot is important in here. It represents data and context passed to a template. It can represent global context, or when used inside {{ range }} it represents iteration loop value.
  • This json structure is what an alert looks like. Notice alerts being an array and commonAnnotations being object. For array theres need to loop over it to get access to the values in it. For objects one just needs to target the name from global context.. using dot at the beginning.
  • To iterate over alerts array.
  • To just access a value - {{ .CommonAnnotations.summary }}
  • Then theres conditional things one can do in golang templates, but I am not going to dig that deep...

Templates resources



Caddy reverse proxy monitoring

What can be seen in this example:

  • Use of Prometheus to monitor a docker container - caddy.
  • How to import a dashobard to grafana.
  • Use of Loki to monitor logs of a docker container.
  • How to set Promtail to push only certain values and label logs.
  • How to use geoip part of Promtail.
  • How to create dashboard in grafana from data in Loki.

Requirements - grafana, loki, caddy.

logo-caddy

Reverse proxy is kinda linchpin of a selfhosted setup as it is in charge of all the http/https traffic that goes in. So focus on monitoring this keystone makes sense.

Requirements - grafana, prometheus, loki, caddy container

Caddy - Metrics - Prometheus

logo

Caddy has build in exporter of metrics for prometheus, so all that is needed is enabling it, scrape it by prometheus, and import a dashboard.

  • Edit Caddyfile to enable metrics.

    Caddyfile
    {
        servers {
            metrics
        }
    
        admin 0.0.0.0:2019
    }
    
    
    a.{$MY_DOMAIN} {
        reverse_proxy whoami:80
    }
    
  • Edit compose to publish 2019 port.
    Likely not necessary if Caddy and Prometheus are on the same docker network, but its nice to check if the metrics export works at <docker-host-ip>:2019/metrics

    docker-compose.yml
    services:
    
      caddy:
        image: caddy
        container_name: caddy
        hostname: caddy
        restart: unless-stopped
        env_file: .env
        ports:
          - "80:80"
          - "443:443"
          - "443:443/udp"
          - "2019:2019"
        volumes:
          - ./Caddyfile:/etc/caddy/Caddyfile
          - ./caddy_config:/data
          - ./caddy_data:/config
    
    networks:
      default:
        name: $DOCKER_MY_NETWORK
        external: true
    
  • Edit prometheus.yml to add caddy scraping point

    prometheus.yml
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    
    scrape_configs:
      - job_name: 'caddy'
        static_configs:
          - targets: ['caddy:2019']
    
  • In grafana import caddy dashboard

But these metrics are about performance and load put on Caddy, which in selfhosted environment will likely be minimal and not interesting.
To get more intriguing info of who, when, from where, connects to what service,.. well for that monitoring of access logs is needed.



Caddy - Logs - Loki

logs_dash

Loki itself just stores the logs. To get them to Loki a Promtail container is used that has access to caddy's logs. Its job is to scrape them regularly, maybe process them in some way, and then push them to Loki.
Once there, a basic grafana dashboard can be made.

The setup

  • Have Grafana, Loki, Caddy working

  • Edit Caddy compose, bind mount /var/log/caddy.
    Add to the compose also Promtail container, that has the same logs bind mount, along with bind mount of its config file.
    Promtail will scrape logs to which it now has access and pushes them to Loki.

    docker-compose.yml
    services:
    
      caddy:
        image: caddy
        container_name: caddy
        hostname: caddy
        restart: unless-stopped
        env_file: .env
        ports:
          - "80:80"
          - "443:443"
          - "443:443/udp"
          - "2019:2019"
        volumes:
          - ./Caddyfile:/etc/caddy/Caddyfile
          - ./caddy_data:/data
          - ./caddy_config:/config
          - ./caddy_logs:/var/log/caddy
    
      # LOG AGENT PUSHING LOGS TO LOKI
      promtail:
        image: grafana/promtail
        container_name: caddy-promtail
        hostname: caddy-promtail
        restart: unless-stopped
        volumes:
          - ./promtail-config.yml:/etc/promtail-config.yml
          - ./caddy_logs:/var/log/caddy:ro
        command:
          - '-config.file=/etc/promtail-config.yml'
    
    networks:
      default:
        name: $DOCKER_MY_NETWORK
        external: true
    
    promtail-config.yml
    clients:
      - url: http://loki:3100/loki/api/v1/push
    
    scrape_configs:
      - job_name: caddy_access_log
        static_configs:
          - targets:
              - localhost
            labels:
              job: caddy_access_log
              host: example.com
              agent: caddy-promtail
              __path__: /var/log/caddy/*.log
    
  • Promtail scrapes a logs, one line at the time and is able to do neat things with it before sending it - add labels, ignore some lines, only send some values,...
    Pipelines are used for this. Bellow is an example of extracting just a single value - an IP address and using it in a tempalte that gets send to Loki and nothing else. Here's some more to read on this.

    promtail-config.yml customizing fields
    clients:
      - url: http://loki:3100/loki/api/v1/push
    
    scrape_configs:
      - job_name: caddy_access_log
    
        static_configs:
          - targets:
              - localhost
            labels:
              job: caddy_access_log
              host: example.com
              agent: caddy-promtail
              __path__: /var/log/caddy/*.log
    
        pipeline_stages:
          - json:
              expressions:
                request_remote_ip: request.remote_ip
          - template:
              source: output  # creates empty output variable
              template: '{"remote_ip": {{.request_remote_ip}}}'
          - output:
              source: output
    
    
  • Edit Caddyfile to enable access logs. Unfortunately this can't be globally enabled, so the easiest way seems to be to create a logging snippet called log_common and copy paste the import line in to every site block.

    Caddyfile
    (log_common) {
      log {
        output file /var/log/caddy/caddy_access.log
      }
    }
    
    ntfy.example.com {
      import log_common
      reverse_proxy ntfy:80
    }
    
    mealie.{$MY_DOMAIN} {
      import log_common
      reverse_proxy mealie:80
    }
    
  • at this points logs should be visible and explorable in grafana
    Explore > {job="caddy_access_log"} |= "" | json

Geoip

geoip_info

Promtail got recently a geoip stage. One can feed it an IP address and an mmdb geoIP database and it adds geoip labels to the log entry.

The official documentation.

  • Register a free account on maxmind.com.

  • Download one of the mmdb format databases

    • GeoLite2 City - 70MB full geoip info - city, postal code, time zone, latitude/longitude,..
    • GeoLite2 Country 6MB, just country and continent
  • Bind mount whichever database in to promtail container.

    docker-compose.yml
    services:
    
      caddy:
        image: caddy
        container_name: caddy
        hostname: caddy
        restart: unless-stopped
        env_file: .env
        ports:
          - "80:80"
          - "443:443"
          - "443:443/udp"
          - "2019:2019"
        volumes:
          - ./Caddyfile:/etc/caddy/Caddyfile
          - ./caddy_data:/data
          - ./caddy_config:/config
          - ./caddy_logs:/var/log/caddy
    
      # LOG AGENT PUSHING LOGS TO LOKI
      promtail:
        image: grafana/promtail
        container_name: caddy-promtail
        hostname: caddy-promtail
        restart: unless-stopped
        volumes:
          - ./promtail-config.yml:/etc/promtail-config.yml
          - ./caddy_logs:/var/log/caddy:ro
          - ./GeoLite2-City.mmdb:/etc/GeoLite2-City.mmdb:ro
        command:
          - '-config.file=/etc/promtail-config.yml'
    
    networks:
      default:
        name: $DOCKER_MY_NETWORK
        external: true
    
  • In promtail config, json stage is added where IP address is loaded in to a variable called remote_ip, which then is used in geoip stage. If all else is set correctly, the geoip labels are automaticly added to the log entry.

    geoip promtail-config.yml
    clients:
      - url: http://loki:3100/loki/api/v1/push
    
    scrape_configs:
      - job_name: caddy_access_log
    
        static_configs:
          - targets:
              - localhost
            labels:
              job: caddy_access_log
              host: example.com
              agent: caddy-promtail
              __path__: /var/log/caddy/*.log
    
        pipeline_stages:
          - json:
              expressions:
                remote_ip: request.remote_ip
    
          - geoip:
              db: "/etc/GeoLite2-City.mmdb"
              source: remote_ip
              db_type: "city"
    

Can be tested with opera build in VPN, or some online site tester.

Dashboard

panel1

  • new panel, will be time series graph showing Subdomains hits timeline

    • Graph type = Time series
    • Data source = Loki
    • switch from builder to code
      sum(count_over_time({job="caddy_access_log"} |= "" | json [1m])) by (request_host)
    • Query options > Min interval = 1m
    • Transform > Rename by regex
      • Match = \{request_host="(.*)"\}
      • Replace = $1
    • Title = "Subdomains hits timeline"
    • Transparent
    • Tooltip mode = All
    • Tooltip values sort order = Descending
    • Legen Placement = Right
    • Value = Total
    • Graph style = Bars
    • Fill opacity = 50

panel2

  • Add another panel, will be a pie chart, showing subdomains divide

    • Graph type = Pie chart
    • Data source = Loki
    • switch from builder to code
      sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_host)
    • Query options > Min interval = 1m
    • Transform > Rename by regex
      • Match = \{request_host="(.*)"\}
      • Replace = $1
    • Title = "Subdomains divide"
    • Transparent
    • Legend Placement = Right
    • Value = Last

panel3

  • Add another panel, will be a Geomap, showing location of machine accessing Caddy

    • Graph type = Geomap
    • Data source = Loki
    • switch from builder to code
      {job="caddy_access_log"} |= "" | json
    • Query options > Min interval = 1m
    • Transform > Extract fields
      • Source = labels
      • Format = JSON
        1. Field = geoip_location_latitude; Alias = latitude
        1. Field = geoip_location_longitude; Alias = longitude
    • Title = "Geomap"
    • Transparent
    • Map view > View > Drag and zoom around > Use current map setting
  • Add another panel, will be a pie chart, showing IPs that hit the most

    • Graph type = Pie chart
    • Data source = Loki
    • switch from builder to code
      sum(count_over_time({job="caddy_access_log"} |= "" | json [$__range])) by (request_remote_ip)
    • Query options > Min interval = 1m
    • Transform > Rename by regex
      • Match = \{request_remote_ip="(.*)"\}
      • Replace = $1
    • Title = "IPs by number of requests"
    • Transparent
    • Legen Placement = Right
    • Value = Last or Total
  • Add another panel, this will be actual log view

    • Graph type - Logs
    • Data source - Loki
    • Switch from builder to code
    • query - {job="caddy_access_log"} |= "" | json
    • Title - empty
    • Deduplication - Exact or Signature
    • Save

panel3

Update

Manual image update:

  • docker-compose pull
  • docker-compose up -d
  • docker image prune

Backup and restore

Backup

Using borg that makes daily snapshot of the entire directory.

Restore

  • down the containers docker-compose down
  • delete the entire monitoring directory
  • from the backup copy back the monitoring directory
  • start the containers docker-compose up -d