Recovery — scenario router¶
A short list of "something broke; what do I do?" with pointers to the canonical procedure for each scenario. Most of the actual content lives in the Backup & Recovery, Rebuild Checklist, or the ZFS Operations / Troubleshooting pages — this page is the way in.
Pick your scenario¶
| Scenario | First action | Canonical procedure |
|---|---|---|
| Accidentally deleted a file or directory | Find the most recent snapshot, copy it back | ZFS Snapshots -> Reading from a snapshot |
| Bad service upgrade — Nextcloud / Authentik / etc. | Stop service, rollback its dataset, restart | Docker Integration -> Snapshot before risky operations |
| Bad VM-side change — Windows Update broke the guest | Stop VM, rollback its dataset, start | VM Storage -> When the VM corrupts its filesystem |
| Container won't start | docker compose down && up --force-recreate — data is on ZFS, not in the container | Docker Integration |
| ZFS pool reports DEGRADED / disk error | Read zpool status -v, plan disk replacement | ZFS Operations -> Disk replacement |
| ZFS pool won't import after reboot | Manual zpool import, check zfs-import services | ZFS Troubleshooting -> I rebooted and the pool isn't there |
| Host OS broken; ZFS pool fine | Reinstall Ubuntu, re-import pool, restore services | Rebuild Checklist |
| Disk failure on the no-redundancy pool | Replace disk, restore from off-host backup | Backup & Recovery -> Off-site target + Rebuild Checklist |
| Pool metadata corruption (FAULTED) | Try read-only import; rewind with -F; restore from backup if needed | ZFS Troubleshooting -> Pool is FAULTED / UNAVAIL |
| Lost ZFS encryption passphrase | No recovery is possible | ZFS Encryption -> Lost passphrase |
| Database (Postgres / MariaDB) corruption | Rollback DB dataset, or restore from SQL dump | See Database from snapshot below |
| Forgot SSH access (locked out) | Console-rescue path or single-user GRUB boot | See Locked-out recovery below |
| Tailscale stopped working | Re-auth, check ACLs | Tailscale Troubleshooting |
Database from snapshot or SQL dump¶
Two paths depending on what's available.
From ZFS snapshot (fastest)¶
# Stop the container so it isn't writing during rollback
cd /path/to/compose/dir
docker compose stop postgres
# Rollback the DB dataset
sudo zfs rollback tank/db/postgres@before-upgrade-2026-05-17
# Start again
docker compose start postgres
docker compose logs --tail=50 postgres
The rollback is point-in-time-consistent (matches the txg at which the snapshot was taken). Postgres will recover from its WAL on next start.
From SQL dump (slower, more portable)¶
If you took a pg_dump / mysqldump before the bad change:
# Postgres
docker exec -i postgres-container psql -U user database < /mnt/tank/backups/db-2026-05-17.sql
# MariaDB / MySQL
docker exec -i mariadb-container mysql -u root -p database < /mnt/tank/backups/db-2026-05-17.sql
For Postgres specifically, use pg_restore for custom-format dumps:
Locked out of SSH¶
If ssh user@host stops working after a config change:
- Try from a different source IP / Tailscale. Lockouts (e.g. fail2ban, pam_faillock) are usually per-source IP.
- Reach the box on its single HDMI output + keyboard. Use a recovery shell to undo the change:
# Reset PAM faillock for root (and others)
sudo faillock --user root --reset
# Or roll back ssh config
sudo cp /etc/ssh/sshd_config.bak /etc/ssh/sshd_config
sudo systemctl restart ssh
single or init=/bin/bash, drop to a root shell, fix the config. 4. As a last resort, boot from a Ubuntu live USB, mount the root filesystem, and edit /etc/ssh/sshd_config directly. See PAM -> Recovery if root gets locked out for the faillock specifics.
Recovery drill schedule¶
Pick a cadence and stick to it. Skipping drills is how you find out your backups didn't actually work the day you really need them.
- Monthly: restore a single file from the most recent snapshot. Five minutes.
- Quarterly: clone a dataset to a new mountpoint and start the service against it — verifies the snapshot is actually usable, not just present.
- Quarterly: test the off-site backup by listing it and restoring one file.
- Annually: walk through the Rebuild Checklist on spare hardware or a VirtualBox lab (see ZFS VirtualBox Lab). Time how long it takes; that's your real RTO.
A drill that doesn't end in "I successfully read the data" doesn't count.