Replacing ZFS system drives in Proxmox

Running Proxmox in a root-on-zfs configuration in a RAID10 pool results in an interesting artifact: We need a boot volume from which to start our system and initialize the elements required to recognize a ZFS pool. In effect, the first mirror pair in our disk set will have (at least) two partitions: a regular filesystem on the first partition and a second partition to participate in the ZFS pool.

To see how it all works together, I tried failing a drive and replacing it with a different one.

Happy-case

If the drives would have had identical sector sizes, the operation would have been simple. In this case, sdb is the good mirror volume and sda is the new, empty drive. We want to copy the working partition table from the good drive to the new one, and then randomize the UUID of the new drive to avoid catastrophic confusion on the part of ZFS:

# sgdisk /dev/sdb -R /dev/sda
# sgdisk -G /dev/sda

After that, we should be able to use gdisk to view the partition table, to identify what partition does what, and simply copy the contents of the good partitions from the good mirror to the new drive:

# gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): p
Disk /dev/sda: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 8-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34            2047   1007.0 KiB  EF02  
   2            2048      5860516749   2.7 TiB     BF01  zfs
   9      5860516750      5860533134   8.0 MiB     BF07  

Command (? for help): q
# dd if=/dev/sdb1 of=/dev/sda1
# dd if=/dev/sdb9 of=/dev/sda9

Then we would add the new disk to our ZFS pool and have it resilvered:

# zpool replace rpool /dev/sda2

To view the resilvering process:

# zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep  1 18:48:13 2018
	2.43T scanned out of 2.55T at 170M/s, 0h13m to go
	1.22T resilvered, 94.99% done
config:

	NAME             STATE     READ WRITE CKSUM
	rpool            DEGRADED     0     0     0
	  mirror-0       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      old        UNAVAIL      0    63     0  corrupted data
	      sda2       ONLINE       0     0     0  (resilvering)
	    sdb2         ONLINE       0     0     0
	  mirror-1       ONLINE       0     0     0
	    sdc          ONLINE       0     0     0
	    sdd          ONLINE       0     0     0
	logs
	  sde1           ONLINE       0     0     0
	cache
	  sde2           ONLINE       0     0     0

errors: No known data errors

The process is time consuming on large drives, but since ZFS both understands the underlying disk layout and the filesystem on top of it, resilvering will only occur on blocks that are in use, which may save us a lot of time, depending on the extent to which our filesystem is filled.

When resilvering is done, we’ll just make sure there’s something to boot from on the new drive:

# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

Real life intervenes

Unfortunately for me, the new drive I tried had the modern 4 KB sector size (“Advanced Format / 4Kn”), while my old drives were stuck with the older 512 B standard. This led to the interesting side effect that my new drive was too small to fit volumes according to the healthy mirror drive’s partition table:

# sgdisk /dev/sdb -R /dev/sda
Caution! Secondary header was placed beyond the disk's limits! Moving the header, but other problems may occur!

In the end, what I ended up doing was to use gdisk to create a new partition table with volume sizes for partitions 1 and 9 as similar as possible to those of the healthy mirror (but not smaller!), entirely skipping the steps involving the sgdisk utility. The rest of the steps were identical.

The next problem I encountered was a bit worse: Even though ZFS in the Proxmox VE installation managed 4Kn drives just fine, there was simply no way to get the HP MicroServer Gen7 host to boot from one, so back to the old 3 TB WD RED I went.

Conclusion

Running root-on-zfs in a striped mirrors (“RAID10”) configuration complicates the replacement of any of the drives in the first mirror pair slightly compared to a setup where the ZFS pool is used for storage only.

Fortunately the difference is minimal, and except for the truly dangerous syntax and unclear documentation of the sgdisk command, replacing a boot disk really boils down to four steps:

  1. Make sure the relevant partitions exist.
  2. Copy non-ZFS-data from the healthy drive to the new one.
  3. Resilver the ZFS volume.
  4. Install GRUB.

In a pure data disk, the only thing we have to think about is step 3.

On the other hand, running too new hardware components in old servers doesn’t always work as intended. Note to the future me: Any meaningful expansion of disk space will require newer server hardware than the N54L-based MicroServer.