ZFS backups in Proxmox – Part 2

A while ago I wrote about trying out pve-zsync for backing up some Proxmox VE entities. I kept using regular Proxmox backups for the other machines, though: It is a robust way to get recoverable machine backups but it’s not very elegant. For example all backups are full: There’s no logic for managing incremental or differential backups. The last straw was a bug in the Proxmox web interface where these native full backups kept landing on my SSD-backed disk pool which is stupid for two reasons: a) it gave me no on-site protection from disk failures which after user error is the most likely reason to need a backup, and b) it used up valuable space on my most expensive pool. Needless to say, I scrapped that backup solution (and pve-zsync) completely.

My new solution is based entirely on Jim Salter’s excellent tools sanoid and syncoid. Sanoid now gives me hourly ZFS snapshots of all of my virtual machines and containers and of my base system, with timely purging of old snapshots. On my production server, syncoid makes sure these snapshots are cloned to my backup pool, and on my off-site server, syncoid fetches snapshots from the backup pool on the production server to its own backup pool. This means I have a better, cleaner, faster and most importantly working backup solution with considerably less clutter than before: A config file for sanoid and a few cron jobs to trigger syncoid in the right way.

ZFS backups in Proxmox

I’ve been experimenting with using ZFS snapshots for on- and off-site backups of my Proxmox virtualization environment. For now I’m leaning towards using pve-zsync for backing up my bigger but non-critical machines, and then using syncoid to achieve incremental pull backups off-site. After the initial seed – which I perform over a LAN link – only block-level changes need to be transferred, which a regular home connection at a synchronous 100 Mbps should be more than capable of handling.

One limitation in pve-zsync I stumbled upon is that it will trip itself up if a VM has multiple disks stored on different ZFS pools. One of my machines was configured to have its EFI volume and root filesystem on SSD storage, while the bulk data drive was stored on a mechanical disk. This didn’t work at all, with an error message that wasn’t exactly crystal clear:

# pve-zsync create -source 105 -dest backuppool/zsync -name timemachinedailysync -maxsnap 14
Job --source 105 --name timemachinedailysync got an ERROR!!!
ERROR Message:
COMMAND:
	zfs send -- datapool/vm-105-disk-0@rep_timemachinedailysync_2020-04-05_11:32:01 | zfs recv -F -- backuppool/zsync/vm-105-disk-0
GET ERROR:
	cannot receive new filesystem stream: destination has snapshots (eg. backuppool/zsync/vm-105-disk-0@rep_timemachinedailysync_2020-04-05_11:32:01)
must destroy them to overwrite it

Of course removing the snapshots in question didn’t help at all – but moving all disk images belonging to the machine to a single ZFS pool solved the issue immediately.

The other problem is that while this program is VM aware while backing up, it only performs ZFS snapshots on the actual dataset(s) backing the drive(s) of a VM or container – it doesn’t by itself backup the machine configuration. This means a potentially excellent recovery point objective (RPO), but the recovery time objective (RTO) will suffer as an effect: A critical service won’t get back online until someone creates an appropriate machine and connects the backed up drives.

I will be experimenting with variations of the tools available to me, to see if I can simplify the restore process somewhat.