Skip to main content

Complete monitoring with Prometheus, Grafana and Loki: metrics, logs and Docker containers

Rogelio Guerra Riverón
Author
Rogelio Guerra Riverón
Building my own web infrastructure from scratch. Here I document each step: servers, networks, containers and everything that comes along.

A server without monitoring is a blind server. You don’t know when the disk fills up, which container is consuming too much RAM, or how many 404 requests your web is generating. This article documents how I configured the complete stack: Prometheus + Node Exporter + Grafana + Loki + Promtail.

The architecture
#

[Servidor doméstico]
  ├── node-exporter        → métricas del sistema (CPU, RAM, disco, red)
  ├── docker-stats-        → métricas de contenedores (textfile collector)
  │   collector
  ├── prometheus           → recolecta y almacena métricas
  ├── loki                 → agrega y almacena logs
  ├── promtail             → envía logs de Nginx y syslog a Loki
  └── grafana              → dashboards de todo lo anterior

All services run in Docker, coordinated by the same docker-compose.yml.

System metrics: Node Exporter
#

Node Exporter exposes hardware and OS metrics. The trick: it has to run with network_mode: host to see the actual server network interfaces. If it runs on Docker network, it only sees the container’s eth0 interface.

  node-exporter:
    image: prom/node-exporter:v1.8.2
    network_mode: host
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
      - ./textfile-collector:/textfile:ro
    command:
      - --path.procfs=/host/proc
      - --path.sysfs=/host/sys
      - --path.rootfs=/rootfs
      - --web.listen-address=127.0.0.1:9100
      - --collector.textfile.directory=/textfile

It listens on 127.0.0.1:9100. Prometheus reaches it via 172.17.0.1:9100 (the host’s IP from the Docker network).

Container metrics: docker stats + textfile collector
#

The problem with cAdvisor is that it doesn’t work with Docker 29 and the overlayfs storage driver on cgroupv2 — it fails with “failed to identify read-write layer ID”.

The solution: a lightweight container that runs docker stats every 30 seconds and writes the result in Prometheus format to a file that Node Exporter reads.

#!/bin/bash
# docker_stats.sh
OUTFILE="/textfile/docker_stats.prom"
TMPFILE="${OUTFILE}.tmp"

{
echo "# HELP docker_container_cpu_percent CPU usage percentage per container"
echo "# TYPE docker_container_cpu_percent gauge"
# ... más definiciones ...

docker stats --no-stream --format \
  '{{.Name}}|{{.CPUPerc}}|{{.MemUsage}}|{{.NetIO}}' 2>/dev/null | \
while IFS='|' read -r name cpu mem net; do
    cpu_val=$(echo "$cpu" | tr -d '%' | tr ',' '.')
    # ... conversión de unidades ...
    echo "docker_container_cpu_percent{name=\"${name}\"} ${cpu_val}"
    echo "docker_container_memory_bytes{name=\"${name}\"} ${mem_used_bytes}"
    echo "docker_container_running{name=\"${name}\"} 1"
done

# Contenedores parados
docker ps -a --filter "status=exited" --format '{{.Names}}' 2>/dev/null | \
while read -r name; do
    echo "docker_container_running{name=\"${name}\"} 0"
done

} > "$TMPFILE" && mv "$TMPFILE" "$OUTFILE"

Atomic writes (tmp → final) prevent Prometheus from reading a partially-written file.

  docker-stats-collector:
    image: docker:27-cli
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./textfile-collector:/textfile
      - ./docker_stats.sh:/docker_stats.sh:ro
    entrypoint: sh -c "apk add --no-cache bc > /dev/null 2>&1; while true; do sh /docker_stats.sh; sleep 30; done"

Prometheus: collect and retain
#

  prometheus:
    image: prom/prometheus:v2.51.2
    networks:
      - monitoring
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus-data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=30d
      - --web.enable-lifecycle

Scraping configuration:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: [localhost:9090]

  - job_name: node
    static_configs:
      - targets: [172.17.0.1:9100]
    relabel_configs:
      - target_label: host
        replacement: servidor-casa

172.17.0.1 is the host’s IP accessible from the Docker bridge network. Data is retained for 30 days.

Logs: Loki + Promtail
#

Loki stores logs without indexing the full content — only the labels. Promtail collects them and sends them with labels like job, host, filename.

  promtail:
    image: grafana/promtail:3.3.2
    user: root
    networks:
      - monitoring
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml:ro
      - ./promtail-data:/tmp/promtail
      - ~/infra/web/logs:/logs/nginx:ro
      - /var/log:/logs/host:ro

It needs to run as root to read /var/log.

Grafana: dashboards
#

Grafana connects to Prometheus and Loki as data sources. The most useful dashboards:

System (Node Exporter):

  • Total CPU and per-core
  • RAM used / free / cache
  • Disk: usage per partition, IOPS, throughput
  • Network: inbound/outbound traffic per interface

Containers (docker stats):

  • CPU % per container
  • RAM per container vs limit
  • State (running/stopped)
  • Network traffic per container

Logs (Loki):

  • Nginx logs in real-time
  • Requests by status code (200, 301, 404, 500)
  • Top IPs with most requests
  • Top most-accessed routes

Issue: [$__range] in Loki instant queries
#

When using “stat” or “piechart” panels with Loki, the [$__range] variable doesn’t resolve — Grafana returns “empty duration string”. The solution is to use a fixed duration:

# MAL (en paneles stat/piechart):
sum by(status) (count_over_time({job="nginx"} | pattern ... [$__range]))

# BIEN:
sum by(status) (count_over_time({job="nginx"} | pattern ... [24h]))

“Time series” panels do support [$__interval] correctly.

Stack security
#

  • Prometheus and Loki have no external access — only on the internal monitoring network
  • Grafana is the only access point, protected with Traefik and Let’s Encrypt
  • GF_AUTH_ANONYMOUS_ENABLED=false and GF_USERS_ALLOW_SIGN_UP=false in Grafana
  • Node Exporter listens only on 127.0.0.1, not exposed on all interfaces

Result
#

With this stack you have complete visibility of the server: which processes consume resources, which containers fail, which requests your web receives and what errors it generates. Everything in dashboards accessible from monitor.serviciosrogeliowar.com.


Recommended Equipment#

Affiliate links. No extra cost to you.