ZFS backups in Proxmox

I’ve been experimenting with using ZFS snapshots for on- and off-site backups of my Proxmox virtualization environment. For now I’m leaning towards using pve-zsync for backing up my bigger but non-critical machines, and then using syncoid to achieve incremental pull backups off-site. After the initial seed – which I perform over a LAN link – only block-level changes need to be transferred, which a regular home connection at a synchronous 100 Mbps should be more than capable of handling.

One limitation in pve-zsync I stumbled upon is that it will trip itself up if a VM has multiple disks stored on different ZFS pools. One of my machines was configured to have its EFI volume and root filesystem on SSD storage, while the bulk data drive was stored on a mechanical disk. This didn’t work at all, with an error message that wasn’t exactly crystal clear:

# pve-zsync create -source 105 -dest backuppool/zsync -name timemachinedailysync -maxsnap 14
Job --source 105 --name timemachinedailysync got an ERROR!!!
ERROR Message:
COMMAND:
	zfs send -- datapool/vm-105-disk-0@rep_timemachinedailysync_2020-04-05_11:32:01 | zfs recv -F -- backuppool/zsync/vm-105-disk-0
GET ERROR:
	cannot receive new filesystem stream: destination has snapshots (eg. backuppool/zsync/vm-105-disk-0@rep_timemachinedailysync_2020-04-05_11:32:01)
must destroy them to overwrite it

Of course removing the snapshots in question didn’t help at all – but moving all disk images belonging to the machine to a single ZFS pool solved the issue immediately.

The other problem is that while this program is VM aware while backing up, it only performs ZFS snapshots on the actual dataset(s) backing the drive(s) of a VM or container – it doesn’t by itself backup the machine configuration. This means a potentially excellent recovery point objective (RPO), but the recovery time objective (RTO) will suffer as an effect: A critical service won’t get back online until someone creates an appropriate machine and connects the backed up drives.

I will be experimenting with variations of the tools available to me, to see if I can simplify the restore process somewhat.

Moving Proxmox /boot to USB stick

Some short notes I made along the way to benefit the future me.

Background

On my new server, Proxmox was unable to boot directly to a ZFS file system on a drive connected via the HBA controller. UPDATE (2020-01-27): The SuperMicro X10SRH-CLN4F motherboard boots just fine from a root-on-ZFS disk in UEFI mode from the built-in SAS HBA. The only required change is the last step in the description below; to add a delay before attempting to mount ZFS volumes at boot-time.

There is a potential drawback to installing Proxmox in root-on-ZFS mode in a UEFI system: The drive gets partitioned, so ZFS doesn’t get uninhibited access to the entire block storage. This may or may not make a difference for performance, but in terms of speed on an SSD solution, I haven’t really seen any cause for concern for my real-world use case. An alternative would be to install the underlying operating system to a separate physical drive.

Also note that the workaround below works on a single vFAT volume. Since FAT doesn’t support symlinks, kernel or initramfs updates in Proxmox/Debian will require some manual work, which most sane people would likely wish to avoid.

I’m leaving the rest of my article intact for posterity:


My workaround was to place /boot – not the system – on a USB stick connected directly to the motherboard.

Process

After installation, reboot with the Proxmox installation medium, but select Install Proxmox VE (Debug mode).

When the first shell appears, Ctrl+D to have the system load the necessary drivers.

Check the name of the USB drive.

lsblk

Partition it.

cfdisk /dev/sdb

Clear the disk, create an EFI System partition and write the changes. Then apply a FAT to the new partition

mkfs.vfat /dev/sdb1

Prepare to chroot into the installed Proxmox instance

mkdir /media/rescue
zpool import -fR /media/rescue rpool
mount -o bind /dev /media/rescue/dev
mount -o bind /sys /media/rescue/sys
mount -o bind /dev /media/rescue/dev
chroot /media/rescue

Make room for the new /boot

mv /boot /boot.bak

Edit /etc/fstab and add the following:

/dev/sdb1 /boot vfat defaults 0 0

Make the stick bootable

mount -a
grub-install --efi-directory=/boot/efi /dev/sdb
update-grub
grub-mkconfig -o /boot/grub/grub.cfg

Exit the chroot, unmount the ZFS file system (zfs export rpool)and reboot

In my specific case I had a problem where I got stuck in a shell with the ZFS pool not mountable.

/sbin/zpool import -Nf rpool

Exit to continue the boot process. Then edit /etc/default/zfs and edit a delay before attempting to boot the file system.

ZFS_INITRD_PRE_MOUNTROOT_SLEEP=15

Then apply the new configuration:

update-initramfs -u

Replacing ZFS system drives in Proxmox

Running Proxmox in a root-on-zfs configuration in a RAID10 pool results in an interesting artifact: We need a boot volume from which to start our system and initialize the elements required to recognize a ZFS pool. In effect, the first mirror pair in our disk set will have (at least) two partitions: a regular filesystem on the first partition and a second partition to participate in the ZFS pool.

To see how it all works together, I tried failing a drive and replacing it with a different one.

Happy-case

If the drives would have had identical sector sizes, the operation would have been simple. In this case, sdb is the good mirror volume and sda is the new, empty drive. We want to copy the working partition table from the good drive to the new one, and then randomize the UUID of the new drive to avoid catastrophic confusion on the part of ZFS:

# sgdisk /dev/sdb -R /dev/sda
# sgdisk -G /dev/sda

After that, we should be able to use gdisk to view the partition table, to identify what partition does what, and simply copy the contents of the good partitions from the good mirror to the new drive:

# gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): p
Disk /dev/sda: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 8-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34            2047   1007.0 KiB  EF02  
   2            2048      5860516749   2.7 TiB     BF01  zfs
   9      5860516750      5860533134   8.0 MiB     BF07  

Command (? for help): q
# dd if=/dev/sdb1 of=/dev/sda1
# dd if=/dev/sdb9 of=/dev/sda9

Then we would add the new disk to our ZFS pool and have it resilvered:

# zpool replace rpool /dev/sda2

To view the resilvering process:

# zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep  1 18:48:13 2018
	2.43T scanned out of 2.55T at 170M/s, 0h13m to go
	1.22T resilvered, 94.99% done
config:

	NAME             STATE     READ WRITE CKSUM
	rpool            DEGRADED     0     0     0
	  mirror-0       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      old        UNAVAIL      0    63     0  corrupted data
	      sda2       ONLINE       0     0     0  (resilvering)
	    sdb2         ONLINE       0     0     0
	  mirror-1       ONLINE       0     0     0
	    sdc          ONLINE       0     0     0
	    sdd          ONLINE       0     0     0
	logs
	  sde1           ONLINE       0     0     0
	cache
	  sde2           ONLINE       0     0     0

errors: No known data errors

The process is time consuming on large drives, but since ZFS both understands the underlying disk layout and the filesystem on top of it, resilvering will only occur on blocks that are in use, which may save us a lot of time, depending on the extent to which our filesystem is filled.

When resilvering is done, we’ll just make sure there’s something to boot from on the new drive:

# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

Real life intervenes

Unfortunately for me, the new drive I tried had the modern 4 KB sector size (“Advanced Format / 4Kn”), while my old drives were stuck with the older 512 B standard. This led to the interesting side effect that my new drive was too small to fit volumes according to the healthy mirror drive’s partition table:

# sgdisk /dev/sdb -R /dev/sda
Caution! Secondary header was placed beyond the disk's limits! Moving the header, but other problems may occur!

In the end, what I ended up doing was to use gdisk to create a new partition table with volume sizes for partitions 1 and 9 as similar as possible to those of the healthy mirror (but not smaller!), entirely skipping the steps involving the sgdisk utility. The rest of the steps were identical.

The next problem I encountered was a bit worse: Even though ZFS in the Proxmox VE installation managed 4Kn drives just fine, there was simply no way to get the HP MicroServer Gen7 host to boot from one, so back to the old 3 TB WD RED I went.

Conclusion

Running root-on-zfs in a striped mirrors (“RAID10”) configuration complicates the replacement of any of the drives in the first mirror pair slightly compared to a setup where the ZFS pool is used for storage only.

Fortunately the difference is minimal, and except for the truly dangerous syntax and unclear documentation of the sgdisk command, replacing a boot disk really boils down to four steps:

  1. Make sure the relevant partitions exist.
  2. Copy non-ZFS-data from the healthy drive to the new one.
  3. Resilver the ZFS volume.
  4. Install GRUB.

In a pure data disk, the only thing we have to think about is step 3.

On the other hand, running too new hardware components in old servers doesn’t always work as intended. Note to the future me: Any meaningful expansion of disk space will require newer server hardware than the N54L-based MicroServer.