ZFS backups in Proxmox – Part 2

A while ago I wrote about trying out pve-zsync for backing up some Proxmox VE entities. I kept using regular Proxmox backups for the other machines, though: It is a robust way to get recoverable machine backups but it’s not very elegant. For example all backups are full: There’s no logic for managing incremental or differential backups. The last straw was a bug in the Proxmox web interface where these native full backups kept landing on my SSD-backed disk pool which is stupid for two reasons: a) it gave me no on-site protection from disk failures which after user error is the most likely reason to need a backup, and b) it used up valuable space on my most expensive pool. Needless to say, I scrapped that backup solution (and pve-zsync) completely.

My new solution is based entirely on Jim Salter’s excellent tools sanoid and syncoid. Sanoid now gives me hourly ZFS snapshots of all of my virtual machines and containers and of my base system, with timely purging of old snapshots. On my production server, syncoid makes sure these snapshots are cloned to my backup pool, and on my off-site server, syncoid fetches snapshots from the backup pool on the production server to its own backup pool. This means I have a better, cleaner, faster and most importantly working backup solution with considerably less clutter than before: A config file for sanoid and a few cron jobs to trigger syncoid in the right way.

Troubleshooting vSphere update woes

It’s 2020 and I still occasionally stumble on products that can’t handle international characters.

I’ve been running my update rounds on our vSphere environment, but one host simply refused to perform is update compliance check.

To troubleshoot, I enabled the ssh service and remoted in to the host, looking for errors in /var/log/vua.log. Sure enough, I found an interesting error message:

--> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 33643: ordinal not in range(128)

The number 0xc3 sounds a lot like a Swedish or Norwegian character, so I grep’d the output of esxcfg-info until I found the culprit:

esxcfg-info | grep å
               |----Name............................................Virtual Lab Tången vSAN
                     |----Portset Name..............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                     |----Virtual Switch............................Virtual Lab Tången vSAN
                  |----Name.........................................Virtual Lab Tången vSAN
                        |----Portset Name...........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
                        |----Virtual Switch.........................Virtual Lab Tången vSAN
            |----World Command Line.................................grep å

A vLab I created for a couple of my Veeam SureBackup jobs had a Nordic character in its name, and blocked updates. After removing all traces of the virtual lab and the Standard Switch it had created on the host, the same command showed no traces of characters outside of the limited ASCII set, and updating the host went as smoothly as it usually does.

Lesson learned: Client-side issues with localization may have mostly been solved for a decade or two, but server-side there are still reasons – not good ones, but reasons – to stick to plain English descriptors for everything.

Enabling the booking of Teams meetings in Outlook on Mac

This issue had me scratching my head for a while: With the latest version of Microsoft Office and Microsoft Teams installed on my Mac running Catalina, I couldn’t enable the booking of Teams meetings from Outlook.

The solution turned out to be to remove the regular Office programs and replace them with Office 365. The official instructions for how to do that said to log on to https://www.office.com or to https://aka.ms/office-install. Well, tough luck: There was no way to find a download link there.

Instead the correct way seems to be to download Microsoft 365 from the App Store. There was no obvious way to connect the Office suite to my work account, so I started Outlook and tried adding an account. This triggered a dialog about the possibility to activate a trial or connect to an existing subscription, with the perhaps ill-chosen options Activate and Cancel. Turns out if you press Activate you get to choose whether you actually want to activate the trial or activate Microsoft 365 with an existing account.

While the gods of good UX and the Law of Least Astonishment cry alone in a cave, I now do have a button to schedule a Teams meeting in Outlook. If I only could get the Calendar and Datadog apps installed in Teams, my life would be complete…

Oh, and speaking of great user experience: Incoming calls in Teams on the Mac do not quite steal focus – thanks for that, at least – but they hog cmd+shift+D so that attempting to send a mail from Mail.app will decline the incoming call. That’s not a great design choice, Microsoft. Now why would anybody want to use Mail.app instead of Outlook? Simple: Snappiness and good search. I can accept jumping through some hoops for things I rarely do, if my day-to-day tasks aren’t nerfed by software that feels slow and bloated.

Trusting Palo Alto GlobalProtect to use a macOS machine certificate

On a managed Mac with a machine certificate, when the certificate is renewed, Palo Alto GlobalProtect will prompt for administrative credentials before connecting. This is because the executable isn’t allowed to directly read from the System keychain.

There’s a nice explanation and fix described on Palo Alto’s site, but in case that one goes missing, here’s the workaround:


Open the Keychain Access application and locate the Machine Certificate issued to Mac OS X Client in the System keychain.
Right-click on the private key associated with Certificate and click Get Info, then go to the Access Control tab
Click ‘+’ to select an Application to allow
Press key combination + + G to open Go to Folder
Enter ‘/Applications/GlobalProtect.app/Contents/Resources’ and click Go
Find PanGPS and click it, and then press Add
Save Changes to private key

Panagent

Creating a working Ubuntu 20.04 VMware Image

A while back I was a bit frustrated at Ubuntu for their defaulting to Cloud-Init in the server edition of Ubuntu 18.04. Well I’m right there again, but now with Ubuntu 20.04.

First of all, Cloud-Init is back, and it’s not more useful to me now than it was the last time. My process is based on the tips in VMware’s KB54986:

sudo apt purge cloud-init && sudo apt autoremove
sudo rm -rf /etc/cloud
sudo sed -i -e 's&D /tmp&#D /tmp&g' /usr/lib/tmpfiles.d/tmp.conf
[Unit]
(...)
After=dbus.service

Ubuntu 20.04 also retains the idiotic habit of not presenting the computer’s MAC address as the identifier for DHCP requests, which necessitates a change to /etc/netplan/00-installer-config.yaml:

network:
  ethernets:
    ens192:
      dhcp4: true
      dhcp-identifier: mac
  version: 2

The final piece of the puzzle is to upgrade vCenter to at least version 6.7U3g, since that’s the first one that supports guest customization of Ubuntu 20.04 machines.

Another case of “Who watches the watchers”

This text was updated to reflect the current status of the story on 2020-05-27.

Not a good look for Trend Micro: Security researcher Bill Demirkapi took apart their Rootkit Buster software and described his findings in a long, technical article.

The main findings:

  1. The program installs a driver which is designed to subvert Microsoft’s quality control process.
  2. It contains security holes which a piece of malware could potentially piggy-back off of to establish control of a victim’s computer.
  3. Parts of the software are unnecessarily bloated, needlessly adding to the performance overhead many people associate with anti-malware software.

Point 3 may look trivial, but power users already complain about the performance impact of anti-malware suites in day-to-day computer usage. Getting this kind of confirmation that parts of these programs aren’t built to the highest possible standard to keep such impact as small as possible is not going to increase goodwill among those who want their computers to perform optimally and have the added security that third-party anti-malware suites promise.

But the most damning points are, of course, the first and second ones. It’s not acceptable for a security suite to contain insecure code. It’s outright disrespectful to everyone involved – Trend’s customers and Microsoft as authors of the operating system on which their product runs – to have a badly written and/or misbehaving piece of software actively try to behave better when it’s aware of being scrutinized. This is exactly the kind of behavior from which I’d expect Rootkit Buster to protect its users.

I hope we hear more about how this turns out in the future: Trend Micro has an opportunity to make something good out of this but their initial hurried reaction could have been better.

Update: Microsoft has effectively killed off the driver in question. Trend Micro still claims that they weren’t trying to circumvent Microsoft’s QA process, which resurfaces the question of how they could accidentally write code that actively checks whether it is being tested and misbehaves only if it isn’t.

Deploying VMware virtual machines using Ansible

I’ve been experimenting with deploying entire environments using Ansible. As usual I had to pass a couple of small thresholds and stumble into a couple of pitfalls before I was comfortable with solution, and so I’m documenting the process here.

I’m thinking of creating a separate post describing my general Ansible workflow in more detail for anybody who wants to know, but this post will cover how I’ve set up management of my vSphere environment from Ansible. .

Boring prerequisites

First of all, we should set up a user with the necessary rights in the vCenter. The Ansible crew has a good list of the requirements, reiterated here:

Datastore.AllocateSpace on the destination datastore or datastore folder

Network.Assign on the network to which the virtual machine will be assigned

Resource.AssignVMToPool on the destination host, cluster, or resource pool

VirtualMachine.Config.AddNewDisk on the datacenter or virtual machine folder

VirtualMachine.Config.AddRemoveDevice on the datacenter or virtual machine folder

VirtualMachine.Interact.PowerOn on the datacenter or virtual machine folder

VirtualMachine.Inventory.CreateFromExisting on the datacenter or virtual machine folder

VirtualMachine.Provisioning.Clone on the virtual machine you are cloning

VirtualMachine.Provisioning.Customize on the virtual machine or virtual machine folder if you are customizing the guest operating system

VirtualMachine.Provisioning.DeployTemplate on the template you are using

VirtualMachine.Provisioning.ReadCustSpecs on the root vCenter Server if you are customizing the guest operating system

I also added the VirtualMachine.Config.CPUCount, VirtualMachine.Config.Memory, VirtualMachine.Config.EditDevice, and VirtualMachine.Interact.DeviceConnection rights while I was at it.

These rights were added to a VMware Role. I then assigned this role to my domain user MYDOMAIN\ansible for the entire vCenter server with children.

Unfortunately this wasn’t enough to actually deploy VMs from templates: The ansible user needs to be allowed to write to VM folders or Ansible will barf with a permission-related error message. I solved this by creating the VM folder MyProject/WebServers and giving the MYDOMAIN\ansible user Administrator rights in this specific folder.

For Ansible – or rather Python – to communicate with my vCenter server, I had to ensure the necessary modules were installed. I use pip to ensure I have a recent version of Ansible stuff, and so I issued the relevant command:

pip3 install requests PyVmomi

Setting up the Ansible environment

The following two lines set up the skeleton directory structure I like to use:

mkdir -p myproject/{roles,inventories/test/{group_vars,host_vars/localhost}} && cd myproject
ansible-galaxy init roles/vm-deployment --offline

To clarify: The test subdirectory name has to do with the environment’s purpose, as in Dev, Test, Staging, Prod, rather than this being an experimental environment.

Inventories

A basic inventory file for Ansible may look like this:

---
all:
  children:
    webservers:
      hosts:
        websrvtest1:
        websrvtest2:
        websrvtestn:

The all group may contain an arbitrary number of hosts and child groups, which in turn may contain an arbitrary number of their own hosts or children. It’s also possible to put group and host variables straight into the inventory file, but I prefer to keep them separated. Note how every line ends with a colon (:). That’s on purpose and stuff breaks if they don’t.

Variables

Variables are key to reusable playbooks. Let’s set some up for this task:

vcenter: "vcenter.mydomain.tld"
vc_user: ansible
vc_pass: "{{ vault_vc_pass }}"
vc_datacenter: MyDatacenter
vc_cluster: VSANclstr
vm_template: w2019coretmpl
vm_folder: /MyProject/Test/WebServers
vm_network: vxw-dvs-161618-virtualwire-14-sid-5013-MyProject-Test
vm_datastore: vsanDatastore
vm_customization_spec: Win_Domain_member_DHCP
deploylist:
- cpmwebsrvtest1
- cpmwebsrvtest2
- cpmwebsrvtestn

Vaults

Note the "{{ vault_vc_pass }}" variable: I’m telling Ansible to look up the variable contents from some other variable. In this case it’s a hint to me that the contents are encrypted in an ansible vault. This way I don’t have to worry a lot that someone would get a hold of my private git repo: If they do I figure I have some time to change my secrets. I’m storing the vault in the same directories where I store my variable files, and a vault is intiated like this:

ansible-vault create inventories/test/host_vars/localhost/vault

I generate and store the vault passphrases in a password manager to simplify collaboration with my teams.

The vault file follows the same form as the vars one, but is encrypted on disk:

vault_vc_pass: password

Ansible tasks

The next step is to create a playbook that actually performs the magic here. In this case there’s a single step that’s looped for whatever number of machines (item) that are in my deploylist. There’s a lot more that can be customized with the vmware_guest Ansible module, but in this case my needs are simple: My vCenter customization specification does most of the job.

One thing to look for is the wait_for_customization parameter. This makes sure that Ansible doesn’t proceed to the next task until VMware has finished customizing the VM – in my case renaming the computer and joining it to a domain.

---
- name: Clone template
  vmware_guest:
    validate_certs: False
    hostname: "{{ vcenter }}"
    username: "{{ vc_user }}"
    password: "{{ vc_pass }}"
    datacenter: "{{ vc_datacenter }}"
    cluster: "{{ vc_cluster }}"
    folder: "{{ vm_folder }}"
    template: "{{ vm_template }}" 
    name: "{{ item }}"
    hardware:
      memory_mb: 6144
      num_cpus: 2
      num_cpu_cores_per_socket: 2
    networks:
    - name: "{{ vm_network }}"
    customization_spec: "{{ vm_customization_spec }}"
    wait_for_customization: yes
  with_items: "{{ deploylist }}"

Next we tell the role to invoke our playbook. This is slightly overkill for a role with just one actual task, but it’s nice to build a habit of keeping things tidy.

---
- include: deploy-vm.yml

Getting it all to run

Finally it’s time to create a master playbook to trigger the role (and potentially others):

---
- hosts: localhost 
  any_errors_fatal: true

  roles:
  - vm-deployment

To execute it all, we’ll use the ansible-playbook command:

ansible-playbook deploy-webserver.yml -i inventories/test --ask-vault-pass

After responding with the appropriate vault passphrase, Ansible goes to work, and in a couple of minutes a brand new virtual machine is ready to take on new roles.

ZFS backups in Proxmox

I’ve been experimenting with using ZFS snapshots for on- and off-site backups of my Proxmox virtualization environment. For now I’m leaning towards using pve-zsync for backing up my bigger but non-critical machines, and then using syncoid to achieve incremental pull backups off-site. After the initial seed – which I perform over a LAN link – only block-level changes need to be transferred, which a regular home connection at a synchronous 100 Mbps should be more than capable of handling.

One limitation in pve-zsync I stumbled upon is that it will trip itself up if a VM has multiple disks stored on different ZFS pools. One of my machines was configured to have its EFI volume and root filesystem on SSD storage, while the bulk data drive was stored on a mechanical disk. This didn’t work at all, with an error message that wasn’t exactly crystal clear:

# pve-zsync create -source 105 -dest backuppool/zsync -name timemachinedailysync -maxsnap 14
Job --source 105 --name timemachinedailysync got an ERROR!!!
ERROR Message:
COMMAND:
	zfs send -- datapool/vm-105-disk-0@rep_timemachinedailysync_2020-04-05_11:32:01 | zfs recv -F -- backuppool/zsync/vm-105-disk-0
GET ERROR:
	cannot receive new filesystem stream: destination has snapshots (eg. backuppool/zsync/vm-105-disk-0@rep_timemachinedailysync_2020-04-05_11:32:01)
must destroy them to overwrite it

Of course removing the snapshots in question didn’t help at all – but moving all disk images belonging to the machine to a single ZFS pool solved the issue immediately.

The other problem is that while this program is VM aware while backing up, it only performs ZFS snapshots on the actual dataset(s) backing the drive(s) of a VM or container – it doesn’t by itself backup the machine configuration. This means a potentially excellent recovery point objective (RPO), but the recovery time objective (RTO) will suffer as an effect: A critical service won’t get back online until someone creates an appropriate machine and connects the backed up drives.

I will be experimenting with variations of the tools available to me, to see if I can simplify the restore process somewhat.

Moving Proxmox /boot to USB stick

Some short notes I made along the way to benefit the future me.

Background

On my new server, Proxmox was unable to boot directly to a ZFS file system on a drive connected via the HBA controller. UPDATE (2020-01-27): The SuperMicro X10SRH-CLN4F motherboard boots just fine from a root-on-ZFS disk in UEFI mode from the built-in SAS HBA. The only required change is the last step in the description below; to add a delay before attempting to mount ZFS volumes at boot-time.

There is a potential drawback to installing Proxmox in root-on-ZFS mode in a UEFI system: The drive gets partitioned, so ZFS doesn’t get uninhibited access to the entire block storage. This may or may not make a difference for performance, but in terms of speed on an SSD solution, I haven’t really seen any cause for concern for my real-world use case. An alternative would be to install the underlying operating system to a separate physical drive.

Also note that the workaround below works on a single vFAT volume. Since FAT doesn’t support symlinks, kernel or initramfs updates in Proxmox/Debian will require some manual work, which most sane people would likely wish to avoid.

I’m leaving the rest of my article intact for posterity:


My workaround was to place /boot – not the system – on a USB stick connected directly to the motherboard.

Process

After installation, reboot with the Proxmox installation medium, but select Install Proxmox VE (Debug mode).

When the first shell appears, Ctrl+D to have the system load the necessary drivers.

Check the name of the USB drive.

lsblk

Partition it.

cfdisk /dev/sdb

Clear the disk, create an EFI System partition and write the changes. Then apply a FAT to the new partition

mkfs.vfat /dev/sdb1

Prepare to chroot into the installed Proxmox instance

mkdir /media/rescue
zpool import -fR /media/rescue rpool
mount -o bind /dev /media/rescue/dev
mount -o bind /sys /media/rescue/sys
mount -o bind /dev /media/rescue/dev
chroot /media/rescue

Make room for the new /boot

mv /boot /boot.bak

Edit /etc/fstab and add the following:

/dev/sdb1 /boot vfat defaults 0 0

Make the stick bootable

mount -a
grub-install --efi-directory=/boot/efi /dev/sdb
update-grub
grub-mkconfig -o /boot/grub/grub.cfg

Exit the chroot, unmount the ZFS file system (zfs export rpool)and reboot

In my specific case I had a problem where I got stuck in a shell with the ZFS pool not mountable.

/sbin/zpool import -Nf rpool

Exit to continue the boot process. Then edit /etc/default/zfs and edit a delay before attempting to boot the file system.

ZFS_INITRD_PRE_MOUNTROOT_SLEEP=15

Then apply the new configuration:

update-initramfs -u

Head: Meet Wall.

I spent way more time than I’m comfortable disclosing, troubleshooting an issue with an AD-attached Oracle Linux server that wouldn’t accept ssh logons by domain users.

We use the recommended sssd and realmd to ensure AD membership. Everything looked good, and I could log on using an account that’s a member of the Domain Admins group, and so I released the machine to our developers for further work.

Only they couldn’t log on.

After spending most of the morning looking through my logs and config files, and detaching and re-attaching the server to the domain after tweaking various settings, I suddenly saw the light.

Note to my future self:

Windows runs NetBIOS under the hood! Any machine name over 14 characters of length in a domain joined computer will cause trouble!

Naturally, after setting a more Windows-like hostname and re-joining the domain, everything worked as I expected.