Monitoring¶

Hardware health monitoring and alerting for proactive maintenance.

Hardware Monitoring¶

lm-sensors¶

Install and configure hardware sensors:

sudo apt install -y lm-sensors
sudo sensors-detect --auto

View current readings:

sensors

Example output:

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:         +45.0°C

nvme-pci-0100
Adapter: PCI adapter
Composite:    +38.9°C

amdgpu-pci-0600
Adapter: PCI adapter
edge:         +42.0°C

Temperature Thresholds¶

Component	Normal	Warning	Critical
CPU (Tctl)	< 70C	70-85C	> 85C
GPU (edge)	< 75C	75-90C	> 90C
NVMe	< 50C	50-70C	> 70C

Continuous Monitoring¶

Watch temperatures in real-time:

watch -n 2 sensors

Disk Health¶

smartmontools¶

Install SMART monitoring tools:

sudo apt install -y smartmontools

NVMe Health¶

# Overall health
sudo smartctl -H /dev/nvme0n1

# Detailed info
sudo smartctl -a /dev/nvme0n1

Key metrics to watch:

Percentage Used: Wear indicator (0-100%)
Available Spare: Remaining spare blocks
Temperature: Operating temperature
Media Errors: Should be 0

SATA Drive Health¶

# Health summary
sudo smartctl -H /dev/sda

# Full report with error log
sudo smartctl -a /dev/sda

Key attributes:

Attribute	Good	Warning
Reallocated_Sector_Ct	0	Any increase
Current_Pending_Sector	0	> 0
Offline_Uncorrectable	0	> 0
UDMA_CRC_Error_Count	0	Increasing (cable issue)

Enable SMART Daemon¶

sudo systemctl enable --now smartd

Configure /etc/smartd.conf for email alerts:

DEVICESCAN -a -o on -S on -n standby,q -s (S/../.././02|L/../../6/03) -W 4,45,55 -m root

ZFS Scrub Scheduling¶

Scrubs detect and repair silent data corruption:

# Check last scrub status
zpool status tank

# Manual scrub
sudo zpool scrub tank

# Check scrub progress
zpool status | grep scan

Schedule monthly scrubs via systemd timer:

# /etc/systemd/system/zfs-scrub.timer
[Unit]
Description=Monthly ZFS scrub

[Timer]
OnCalendar=monthly
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/zfs-scrub.service
[Unit]
Description=ZFS scrub

[Service]
Type=oneshot
ExecStart=/sbin/zpool scrub tank

Enable the timer:

sudo systemctl enable --now zfs-scrub.timer

Resource Monitoring¶

Memory and Swap¶

# Current usage
free -h

# Detailed breakdown
cat /proc/meminfo | head -20

Monitor for memory pressure:

# Memory available (not just free)
awk '/MemAvailable/ {print $2/1024 " MB"}' /proc/meminfo

# Swap usage (should be minimal)
swapon --show

CPU Utilization¶

# Quick overview
top -bn1 | head -5

# Per-core usage
mpstat -P ALL 1 1

# Average load
uptime

Load average guidelines for 8-core system:

Load	Status
< 8	Normal
8-16	High
> 16	Overloaded

Disk Capacity¶

# Filesystem usage
df -h

# ZFS pool capacity
zpool list

# Dataset breakdown
zfs list -o name,used,avail,refer

Capacity thresholds:

Usage	Action
< 70%	Normal
70-80%	Plan cleanup
80-90%	Cleanup required
> 90%	Critical

ZFS Performance

ZFS performance degrades significantly above 80% capacity. Keep pools below 80% full.

cgroup Resource View¶

# Interactive cgroup resource monitor
systemd-cgtop

# Docker container resources
docker stats --no-stream

Alerting¶

Health Check Script¶

Create a monitoring script:

#!/bin/bash
# /usr/local/bin/health-check.sh

set -euo pipefail

ALERT_FILE="/tmp/health-alerts"
> "$ALERT_FILE"

# Check CPU temperature
CPU_TEMP=$(sensors | awk '/Tctl/ {print int($2)}')
if [ "$CPU_TEMP" -gt 85 ]; then
    echo "CRITICAL: CPU temperature ${CPU_TEMP}C" >> "$ALERT_FILE"
elif [ "$CPU_TEMP" -gt 70 ]; then
    echo "WARNING: CPU temperature ${CPU_TEMP}C" >> "$ALERT_FILE"
fi

# Check ZFS pool health
POOL_HEALTH=$(zpool status -x)
if [ "$POOL_HEALTH" != "all pools are healthy" ]; then
    echo "CRITICAL: ZFS pool issue detected" >> "$ALERT_FILE"
    zpool status >> "$ALERT_FILE"
fi

# Check disk space
ZFS_CAP=$(zpool list -H -o capacity tank | tr -d '%')
if [ "$ZFS_CAP" -gt 90 ]; then
    echo "CRITICAL: ZFS pool at ${ZFS_CAP}%" >> "$ALERT_FILE"
elif [ "$ZFS_CAP" -gt 80 ]; then
    echo "WARNING: ZFS pool at ${ZFS_CAP}%" >> "$ALERT_FILE"
fi

# Check SMART health
for disk in /dev/nvme?n1 /dev/sd?; do
    [ -b "$disk" ] || continue
    if ! sudo smartctl -H "$disk" | grep -q "PASSED\|OK"; then
        echo "CRITICAL: SMART failure on $disk" >> "$ALERT_FILE"
    fi
done

# Check memory
MEM_AVAIL=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo)
if [ "$MEM_AVAIL" -lt 1024 ]; then
    echo "WARNING: Low memory (${MEM_AVAIL}MB available)" >> "$ALERT_FILE"
fi

# Output results
if [ -s "$ALERT_FILE" ]; then
    cat "$ALERT_FILE"
    exit 1
fi

echo "All systems healthy"
exit 0

Make executable:

sudo chmod +x /usr/local/bin/health-check.sh

Systemd Timer for Health Checks¶

# /etc/systemd/system/health-check.timer
[Unit]
Description=Hourly health check

[Timer]
OnCalendar=hourly
Persistent=true

[Install]
WantedBy=timers.target

# /etc/systemd/system/health-check.service
[Unit]
Description=System health check

[Service]
Type=oneshot
ExecStart=/usr/local/bin/health-check.sh
StandardOutput=journal
StandardError=journal

Enable:

sudo systemctl enable --now health-check.timer

Notification Options¶

Email Alerts¶

Configure msmtp for sending alerts:

sudo apt install -y msmtp msmtp-mta

Configure /etc/msmtprc:

defaults
auth           on
tls            on
tls_trust_file /etc/ssl/certs/ca-certificates.crt

account        default
host           smtp.example.com
port           587
from           server@example.com
user           smtp-user
password       smtp-password

Modify health check to send email on failure:

if [ -s "$ALERT_FILE" ]; then
    cat "$ALERT_FILE" | mail -s "Server Alert: $(hostname)" admin@example.com
fi

Webhook Alerts¶

For services like ntfy, Discord, or Slack:

# ntfy.sh example
if [ -s "$ALERT_FILE" ]; then
    curl -d "$(cat $ALERT_FILE)" ntfy.sh/your-topic
fi

# Discord webhook
if [ -s "$ALERT_FILE" ]; then
    curl -H "Content-Type: application/json" \
         -d "{\"content\": \"$(cat $ALERT_FILE | tr '\n' ' ')\"}" \
         https://discord.com/api/webhooks/xxx/yyy
fi

View Health Check Logs¶

# Recent checks
journalctl -u health-check.service --since "1 hour ago"

# Failed checks only
journalctl -u health-check.service -p err

Quick Reference¶

Task	Command
View temperatures	`sensors`
Check disk health	`sudo smartctl -H /dev/nvme0n1`
ZFS pool status	`zpool status`
Memory usage	`free -h`
Disk capacity	`zpool list`
Run health check	`/usr/local/bin/health-check.sh`
Check timer status	`systemctl list-timers`