ZFS Troubleshooting¶
When things go wrong. Symptom-first, ordered by frequency.
"I rebooted and the pool isn't there"¶
Most common cause: the zfs-import-cache or zfs-mount service didn't run, or the cache file is stale.
# Is the pool importable?
sudo zpool import # list available pools (without importing)
# Import manually
sudo zpool import tank
# If it complains about device paths
sudo zpool import -d /dev/disk/by-id tank
Once imported, verify systemd unit health for next boot:
If zfs-import-cache.service is failing because the cache file is missing/stale, regenerate it:
sudo zpool export tank
sudo zpool import tank # re-imports and writes /etc/zfs/zpool.cache
sudo systemctl restart zfs-import-cache.service
If you specifically don't want a cache file (e.g. on systems with rotating disks at unstable paths):
The scan service scans all attached devices at boot instead of relying on the cache. Slower at boot but more robust.
"zpool import says pool I/O is currently suspended"¶
Usually means a leaf vdev disappeared mid-operation. Status:
If the device is missing, fix the underlying issue (cable, slot, etc.). If it's there but ZFS doesn't recognise it:
"zpool import fails with: a vdev is missing"¶
sudo zpool import
pool: tank
id: 1234567890123456789
state: UNAVAIL
status: One or more devices are missing from the system.
Diagnose:
- A drive is unplugged, dead, or named differently after a kernel update.
- The pool was built with kernel device names and they reordered.
Try by-id discovery:
If the pool was originally created with /dev/sdb-style names, the cache file may have stale paths. Force re-search:
sudo zpool import -d /dev/disk/by-id tank
# After successful import:
sudo zpool export tank
sudo zpool import -d /dev/disk/by-id tank # this time writes the new cache
"Pool is DEGRADED"¶
A vdev lost redundancy but reads/writes still work:
Identify the bad device and follow Operations -> Disk replacement.
For a no-redundancy pool (single-disk vdev), there's no DEGRADED state — there's only ONLINE or UNAVAIL. UNAVAIL = the pool is offline; data is unreachable until the disk comes back or is replaced (in which case you restore from backup).
"Pool is FAULTED / UNAVAIL"¶
The pool can't be opened. Several flavours:
Too many missing devices¶
state: FAULTED
status: One or more devices could not be used because the label is missing
or invalid. There are insufficient replicas for the pool to continue
functioning.
For a redundant pool, more leaves are missing than the topology can tolerate. Find them.
For a no-redundancy pool, any leaf missing is fatal.
Corrupted metadata¶
Try rewinding to an earlier transaction group:
-F rolls back the last few txgs. -X allows discarding more state. Both lose data since the rewind point. Read-only first for forensics:
This recovers when something corrupted the pool's recent metadata but earlier txgs were OK (rare; usually power-loss + flaky hardware).
Hostid mismatch¶
A pool was imported on a different host (/etc/hostid) and not cleanly exported. Force import:
This is normal after restoring a backup, moving disks between machines, or rebuilding the host with a different hostid.
"I can't destroy a snapshot — dataset is busy"¶
Something holds it open. The two common causes:
A clone¶
Destroy the clone first, or promote it (which moves the dependency).
A hold¶
zfs holds <pool>/<dataset>@<snapshot>
# tag creation
# sentinel Sun May 17 14:23:11 2026
sudo zfs release sentinel <pool>/<dataset>@<snapshot>
If you don't know what set the hold, releasing it is safe — the hold mechanism is purely advisory.
"zpool status shows checksum errors"¶
ZFS detected bit-flips during a scrub or normal read.
The -v output lists affected file paths if data was permanently lost (no redundancy to repair from), or just shows error counts if ZFS healed them.
Actions:
- Note the affected disk(s). The leaf vdev with non-zero CKSUM counts is the culprit. Often it's the device whose SMART data also shows issues.
- Check SMART for the device:
"zfs send" or syncoid errors out partway¶
The send stream broke. The receiver may have a partial state.
# On the receiver, find the resume token
zfs get receive_resume_token backup/foo
# On the sender, resume the same stream
sudo zfs send -t <token> | ssh backup-host 'sudo zfs receive -s backup/foo'
This requires -s was passed to the original zfs receive (syncoid does this by default). Without -s, the destination doesn't know how to resume — you'd start over.
If a resume isn't possible:
# Receiver: remove the partial dataset
sudo zfs destroy backup/foo
# Sender: re-send fresh
sudo zfs send -R tank/foo@latest | ssh backup-host 'sudo zfs receive backup/foo'
"zfs receive: destination already exists"¶
You can:
-Fto force-receive, which destroys any state on the destination newer than the incoming snapshot's ancestor.- Pick a different destination.
- Manually
sudo zfs destroy -r backup/foofirst (irreversible).
-F is appropriate when the destination is purely a replica and the source is authoritative.
"zfs unmount" fails with "umount: target is busy"¶
A process holds a file open on the dataset:
Stop the offending process, or force the unmount:
-f is brutal; pending writes may be lost. Prefer fixing the underlying process. Common culprits are Docker containers (docker stop ...), shells (cd out of the directory), and forgotten tail -f sessions.
"zfs mount" fails with "filesystem already mounted"¶
ZFS thinks the dataset is mounted but findmnt shows it isn't, or vice versa. Recover state:
If a directory exists at the mountpoint with non-ZFS contents, ZFS refuses to mount over it. Investigate that directory before forcing.
"Pool is full but df says I have space"¶
ZFS reserves the last few percent of pool capacity for metadata. zfs list is honest about what you can use; df reports the underlying filesystem-like number and can mislead.
Compare with snapshot usage:
Old snapshots that uniquely hold large amounts of data are the usual culprits. Destroy the oldest first.
"I deleted important data; can I restore from a snapshot?"¶
If you have a snapshot from before the deletion:
# Find it
zfs list -t snapshot -o name,creation tank/<dataset>
# Browse it
ls /mnt/tank/<dataset>/.zfs/snapshot/<snap-name>/
# Copy a specific file back
cp /mnt/tank/<dataset>/.zfs/snapshot/<snap>/path/to/file /mnt/tank/<dataset>/path/to/file
If you need to roll the entire dataset back:
This is irreversible — newer snapshots and changes since <snap> are gone.
"send is mind-numbingly slow"¶
Profile:
sudo zfs send -nv tank/foo@s2 # confirm size
sudo zfs send tank/foo@s2 | pv > /dev/null # measure source rate
Common causes / fixes:
- CPU on the SSH cipher: switch to
chacha20-poly1305@openssh.com(ssh -c chacha20-poly1305@openssh.com ...). - Compression already negotiated to gzip on SSH: turn it off (
-o Compression=no); ZFS compression already minimised what's on the wire. - No
-cflag: add-ctozfs sendto keep compressed blocks compressed in transit. - No
-Lflag: large datasets withrecordsize=1Mbenefit from-L. - Receiving side sync writes: ensure the receive target dataset isn't
sync=always. - Network bottleneck:
iperf3between hosts to confirm raw throughput.
"syncoid says it can't find a common snapshot"¶
The source and destination have no shared ancestor. Either:
- Use
--no-sync-snapto retry against an existing snapshot rather than auto-creating one. - Run with
-r --recursivefrom the parent dataset. - Recreate the destination from scratch and full-send.
"ARC is huge / system is sluggish"¶
The default ARC cap is 50% of RAM — 64 GB on a 128 GB box. Cap it:
echo 'options zfs zfs_arc_max=17179869184' | sudo tee /etc/modprobe.d/zfs.conf
echo 17179869184 | sudo tee /sys/module/zfs/parameters/zfs_arc_max
sudo update-initramfs -u
See Pool Creation -> ARC and Tuning -> ARC.
"I can't load encryption keys after a reboot"¶
Causes:
- The key file is missing (
keylocation=file:///...and the file isn't there). - You're typing the wrong passphrase. (There is no recovery — see Encryption -> Lost passphrase.)
- A
zfs change-keywas performed; the old passphrase no longer works.
Check keylocation:
"DKMS rebuild failed after kernel upgrade"¶
Symptom: zfs.ko failed to build, modprobe fails, pool can't import.
# Diagnose
sudo dkms status
sudo apt-get install --reinstall zfs-dkms
# Or build manually
sudo dkms remove zfs/<version> --all
sudo dkms install zfs/<version>
# Confirm
modprobe zfs
If the kernel upgrade was the issue, look in /var/lib/dkms/zfs/<version>/build/make.log for the actual failure (often it's a missing kernel header package).
"zfs commands hang"¶
A frozen ZFS thread (deadlock, hardware fault) can hang admin commands. Check:
For a hardware-induced lock-up: there's not much you can do live. Reboot, watch the imports, replace the bad disk if dmesg blames a specific device.
If zpool hangs while no I/O is happening, sometimes it's a hung systemd-zfs unit. Restart:
"I broke /etc/zfs/zpool.cache"¶
The cache file holds the list of pools to auto-import. If it's corrupted, the pool won't auto-mount but you can still import manually:
If the cache file is completely gone, zpool import (without arguments) scans for available pools — slower but works.
When to nuke and pave¶
There are scenarios where the right answer is "destroy the pool and restore from backup":
- Massive metadata corruption that
-F/-Xcan't recover. - Multiple disk failures beyond the topology's redundancy.
- An accidental
zpool destroy(truly irreversible — there is no undo).
This is why the Backup & Recovery strategy includes an off-host replica. If everything goes wrong, you reinstall the host, create a fresh pool, and zfs receive from the backup target.
The Rebuild Checklist treats this as the worst-case scenario.
When to ask for help¶
Beyond the OpenZFS man pages and the OpenZFS GitHub issues, the helpful resources:
- OpenZFS docs — canonical reference.
- OpenZFS GitHub Discussions — active community.
- reddit r/zfs — practical Q&A.
man 8 zfsandman 8 zpool— well-written and detailed.
For data-recovery scenarios beyond "rollback to a snapshot", paid recovery services exist. Local backups are cheaper.