Container Monitoring¶
Monitor Docker containers with Prometheus, Grafana, and logging.
Overview¶
This stack provides:
- Prometheus - Metrics collection and alerting
- Grafana - Visualization and dashboards
- Node Exporter - Host system metrics
- cAdvisor - Container metrics
- Loki - Log aggregation
- Alertmanager - Alert routing
Quick Start¶
Basic Monitoring Stack¶
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
depends_on:
- prometheus
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Prometheus Configuration¶
Create prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Node Exporter¶
Collect host system metrics.
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
Available Metrics¶
| Metric | Description |
|---|---|
node_cpu_seconds_total | CPU usage |
node_memory_MemTotal_bytes | Total memory |
node_memory_MemAvailable_bytes | Available memory |
node_disk_io_time_seconds_total | Disk I/O |
node_filesystem_size_bytes | Filesystem size |
node_network_receive_bytes_total | Network received |
cAdvisor¶
Container-level metrics.
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg
restart: unless-stopped
Container Metrics¶
| Metric | Description |
|---|---|
container_cpu_usage_seconds_total | Container CPU |
container_memory_usage_bytes | Container memory |
container_network_receive_bytes_total | Network in |
container_network_transmit_bytes_total | Network out |
container_fs_usage_bytes | Filesystem usage |
Grafana Dashboards¶
Pre-configured Dashboards¶
Add dashboard provisioning. Create grafana/provisioning/dashboards/dashboard.yml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
options:
path: /var/lib/grafana/dashboards
Datasource Provisioning¶
Create grafana/provisioning/datasources/datasource.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
Docker Compose with Provisioning¶
services:
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
Recommended Dashboard IDs¶
Import from grafana.com:
| Dashboard | ID | Purpose |
|---|---|---|
| Node Exporter Full | 1860 | Host metrics |
| Docker Container | 893 | Container stats |
| cAdvisor | 14282 | Container details |
| Prometheus Stats | 2 | Prometheus metrics |
Loki for Logs¶
Loki Setup¶
services:
loki:
image: grafana/loki:latest
container_name: loki
ports:
- "3100:3100"
volumes:
- loki_data:/loki
- ./loki-config.yml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
Loki Configuration¶
Create loki-config.yml:
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
Promtail Configuration¶
Create promtail-config.yml:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: containers
static_configs:
- targets:
- localhost
labels:
job: containerlogs
__path__: /var/lib/docker/containers/*/*log
pipeline_stages:
- json:
expressions:
output: log
stream: stream
attrs:
- json:
expressions:
tag:
source: attrs
- regex:
expression: (?P<container_name>(?:[a-zA-Z0-9][a-zA-Z0-9_.-]+))
source: tag
- labels:
stream:
container_name:
- output:
source: output
Alertmanager¶
Setup¶
services:
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
Alertmanager Configuration¶
Create alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
receivers:
- name: 'default'
# Add notification configuration
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/xxx/xxx'
channel: '#alerts'
- name: 'email'
email_configs:
- to: 'alerts@example.com'
from: 'prometheus@example.com'
smarthost: 'smtp.example.com:587'
Alert Rules¶
Create alert-rules.yml:
groups:
- name: containers
rules:
- alert: ContainerDown
expr: absent(container_last_seen{name=~".+"})
for: 1m
labels:
severity: critical
annotations:
summary: "Container {{ $labels.name }} is down"
- alert: HighCPUUsage
expr: sum(rate(container_cpu_usage_seconds_total[5m])) by (name) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.name }}"
- alert: HighMemoryUsage
expr: container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.name }}"
- name: host
rules:
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk usage above 85%"
- alert: HighLoad
expr: node_load1 > 4
for: 5m
labels:
severity: warning
annotations:
summary: "High system load"
Update Prometheus to use alert rules:
# prometheus.yml
rule_files:
- /etc/prometheus/alert-rules.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
Complete Monitoring Stack¶
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alert-rules.yml:/etc/prometheus/alert-rules.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d'
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
depends_on:
- prometheus
- loki
restart: unless-stopped
networks:
- monitoring
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
restart: unless-stopped
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
privileged: true
restart: unless-stopped
networks:
- monitoring
loki:
image: grafana/loki:latest
container_name: loki
ports:
- "3100:3100"
volumes:
- loki_data:/loki
- ./loki-config.yml:/etc/loki/local-config.yaml
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- ./promtail-config.yml:/etc/promtail/config.yml
command: -config.file=/etc/promtail/config.yml
restart: unless-stopped
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
restart: unless-stopped
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
loki_data:
alertmanager_data:
Useful Queries¶
Prometheus (PromQL)¶
# CPU usage per container
sum(rate(container_cpu_usage_seconds_total{name!=""}[5m])) by (name)
# Memory usage per container
container_memory_usage_bytes{name!=""} / 1024 / 1024
# Container restart count
increase(container_restart_count{name!=""}[1h])
# Network traffic
rate(container_network_receive_bytes_total[5m])
Loki (LogQL)¶
# Container logs
{container_name="myapp"}
# Error logs
{container_name="myapp"} |= "error"
# JSON parsed
{container_name="myapp"} | json | level="error"
# Rate of errors
sum(rate({container_name="myapp"} |= "error" [5m]))
See Also¶
- Docker Compose - Compose reference
- Operations Monitoring - System monitoring
- journald - System logs