Deploying VMware virtual machines using Ansible

I’ve been experimenting with deploying entire environments using Ansible. As usual I had to pass a couple of small thresholds and stumble into a couple of pitfalls before I was comfortable with solution, and so I’m documenting the process here.

I’m thinking of creating a separate post describing my general Ansible workflow in more detail for anybody who wants to know, but this post will cover how I’ve set up management of my vSphere environment from Ansible. .

Boring prerequisites

First of all, we should set up a user with the necessary rights in the vCenter. The Ansible crew has a good list of the requirements, reiterated here:

Datastore.AllocateSpace on the destination datastore or datastore folder

Network.Assign on the network to which the virtual machine will be assigned

Resource.AssignVMToPool on the destination host, cluster, or resource pool

VirtualMachine.Config.AddNewDisk on the datacenter or virtual machine folder

VirtualMachine.Config.AddRemoveDevice on the datacenter or virtual machine folder

VirtualMachine.Interact.PowerOn on the datacenter or virtual machine folder

VirtualMachine.Inventory.CreateFromExisting on the datacenter or virtual machine folder

VirtualMachine.Provisioning.Clone on the virtual machine you are cloning

VirtualMachine.Provisioning.Customize on the virtual machine or virtual machine folder if you are customizing the guest operating system

VirtualMachine.Provisioning.DeployTemplate on the template you are using

VirtualMachine.Provisioning.ReadCustSpecs on the root vCenter Server if you are customizing the guest operating system

I also added the VirtualMachine.Config.CPUCount, VirtualMachine.Config.Memory, VirtualMachine.Config.EditDevice, and VirtualMachine.Interact.DeviceConnection rights while I was at it.

These rights were added to a VMware Role. I then assigned this role to my domain user MYDOMAIN\ansible for the entire vCenter server with children.

Unfortunately this wasn’t enough to actually deploy VMs from templates: The ansible user needs to be allowed to write to VM folders or Ansible will barf with a permission-related error message. I solved this by creating the VM folder MyProject/WebServers and giving the MYDOMAIN\ansible user Administrator rights in this specific folder.

For Ansible – or rather Python – to communicate with my vCenter server, I had to ensure the necessary modules were installed. I use pip to ensure I have a recent version of Ansible stuff, and so I issued the relevant command:

pip3 install requests PyVmomi

Setting up the Ansible environment

The following two lines set up the skeleton directory structure I like to use:

mkdir -p myproject/{roles,inventories/test/{group_vars,host_vars/localhost}} && cd myproject
ansible-galaxy init roles/vm-deployment --offline

To clarify: The test subdirectory name has to do with the environment’s purpose, as in Dev, Test, Staging, Prod, rather than this being an experimental environment.

Inventories

A basic inventory file for Ansible may look like this:

---
all:
  children:
    webservers:
      hosts:
        websrvtest1:
        websrvtest2:
        websrvtestn:

The all group may contain an arbitrary number of hosts and child groups, which in turn may contain an arbitrary number of their own hosts or children. It’s also possible to put group and host variables straight into the inventory file, but I prefer to keep them separated. Note how every line ends with a colon (:). That’s on purpose and stuff breaks if they don’t.

Variables

Variables are key to reusable playbooks. Let’s set some up for this task:

vcenter: "vcenter.mydomain.tld"
vc_user: ansible
vc_pass: "{{ vault_vc_pass }}"
vc_datacenter: MyDatacenter
vc_cluster: VSANclstr
vm_template: w2019coretmpl
vm_folder: /MyProject/Test/WebServers
vm_network: vxw-dvs-161618-virtualwire-14-sid-5013-MyProject-Test
vm_datastore: vsanDatastore
vm_customization_spec: Win_Domain_member_DHCP
deploylist:
- cpmwebsrvtest1
- cpmwebsrvtest2
- cpmwebsrvtestn

Vaults

Note the "{{ vault_vc_pass }}" variable: I’m telling Ansible to look up the variable contents from some other variable. In this case it’s a hint to me that the contents are encrypted in an ansible vault. This way I don’t have to worry a lot that someone would get a hold of my private git repo: If they do I figure I have some time to change my secrets. I’m storing the vault in the same directories where I store my variable files, and a vault is intiated like this:

ansible-vault create inventories/test/host_vars/localhost/vault

I generate and store the vault passphrases in a password manager to simplify collaboration with my teams.

The vault file follows the same form as the vars one, but is encrypted on disk:

vault_vc_pass: password

Ansible tasks

The next step is to create a playbook that actually performs the magic here. In this case there’s a single step that’s looped for whatever number of machines (item) that are in my deploylist. There’s a lot more that can be customized with the vmware_guest Ansible module, but in this case my needs are simple: My vCenter customization specification does most of the job.

One thing to look for is the wait_for_customization parameter. This makes sure that Ansible doesn’t proceed to the next task until VMware has finished customizing the VM – in my case renaming the computer and joining it to a domain.

---
- name: Clone template
  vmware_guest:
    validate_certs: False
    hostname: "{{ vcenter }}"
    username: "{{ vc_user }}"
    password: "{{ vc_pass }}"
    datacenter: "{{ vc_datacenter }}"
    cluster: "{{ vc_cluster }}"
    folder: "{{ vm_folder }}"
    template: "{{ vm_template }}" 
    name: "{{ item }}"
    hardware:
      memory_mb: 6144
      num_cpus: 2
      num_cpu_cores_per_socket: 2
    networks:
    - name: "{{ vm_network }}"
    customization_spec: "{{ vm_customization_spec }}"
    wait_for_customization: yes
  with_items: "{{ deploylist }}"

Next we tell the role to invoke our playbook. This is slightly overkill for a role with just one actual task, but it’s nice to build a habit of keeping things tidy.

---
- include: deploy-vm.yml

Getting it all to run

Finally it’s time to create a master playbook to trigger the role (and potentially others):

---
- hosts: localhost 
  any_errors_fatal: true

  roles:
  - vm-deployment

To execute it all, we’ll use the ansible-playbook command:

ansible-playbook deploy-webserver.yml -i inventories/test --ask-vault-pass

After responding with the appropriate vault passphrase, Ansible goes to work, and in a couple of minutes a brand new virtual machine is ready to take on new roles.

Managing Windows servers with Ansible

Although I to a large degree get to play with the fun stuff at work, much of our environment still consists of Windows servers, and that will not be changing for a long time. As I’ve mentioned in earlier posts, I try to script my way around singular Windows servers using Powershell whenever it makes sense, but when a set of changes needs to be performed across groups of servers – especially if it’s something recurring – my tool of choice really is Ansible.

The Ansible management server (which has to be running a Unix-like system) needs to be able to communicate securely with the Windows hosts. WinRM, which is the framework used under the hood, allows for a number of protocols for user authentication and transfer of commands. I personally like to have my communications TLS secured, and so I’ve opted for using CredSSP which defaults to an HTTPS-based communications channel.

A huge gotcha: I tried running the tasks below from a Ubuntu 16.04 LTS server, and there was nothing I could do to get the Python 2.7-dependent Ansible version to correctly verify a TLS certificate from our internal CA. When I switched to running Ansible through Python 3, the exact same config worked flawlessly. The original code has been updated to reflect this state of things.

Enable CredSSP WinRM communications in Windows

Our production domain has a local Certificate Authority, which simplifies some operations. All domain members request their computer certificates from this CA, and the resulting certs have subject lines matching their hostname. The following PowerShell script will allow us to utilize the existing certificates to secure WinRM communications, along with enabling the necessary listener and firewall rules.

$hostname=hostname
# Get the thumbprint of the latest valid machine certificate
$cert=Get-ChildItem -Path cert:\LocalMachine\My -Recurse|? { ($_.Subject -match $hostname) -and ($_.NotAfter -gt $today.date) } | sort { $_.NotAfter } | select -last 1
# Enable Windows Remote Management over CredSSP
Enable-WSManCredSSP -Role Server -Force
# Set up an HTTPS listener with the machine certificate’s thumbprint
New-Item -Path WSMan:\LocalHost\Listener -Transport HTTPS -Address * -CertificateThumbPrint $cert.Thumbprint -Force
# Allow WinRM HTTPS traffic through the firewall
New-NetFirewallRule -DisplayName 'Windows Remote Management (HTTPS-In)' -Name 'Windows Remote Management (HTTPS-In)' -Direction Inbound -Protocol TCP -LocalPort 5986 -RemoteAddress LocalSubnet

Depending on your desired security level you may want to change the RemoteAddress property of the firewall rule to only allow management traffic from a single host or similar. It is a bad idea to allow remote management from untrusted networks!

Enable CredSSP WinRM communications from Ansible

To enable Ansible to use CredSSP on an Ubuntu server, we’ll install a couple of packages:

sudo apt install libssl-dev
pip3 install pyOpenSSL
pip3 install pywinrm[credssp]

We then need to ensure that the Ansible server trusts the certificates of any Windows servers:

sudo chown root our-ca.crt
sudo chmod 744 our-ca.crt
sudo mv our-ca.crt /usr/local/share/ca-certificates/
sudo update-ca-certificates

And finally we’ll tell Ansible how to connect to our Windows servers – including where to find the CA-file – by adding the following to the group_vars for the server group:

ansible_user: "username@domain.tld"
ansible_password: "YourExcellentPasswordHere"
ansible_connection: winrm
ansible_port: 5986
ansible_winrm_transport: credssp
ansible_winrm_ca_trust_path: /etc/ssl/certs

Naturally, if we’re storing credentials in a file, it should be protected as an Ansible vault.

Finally we can try our config out. Note, as mentioned in the beginning of this article, that I had to resort to running Ansible through Python3 to correctly validate my CA cert. It’s time to get with the times, folks.. 🙂

python3 $(which ansible) windowsserver.domain.tld --ask-vault-pass -m win_ping
Vault password: 
windowsserver.domain.tld | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

To ensure that playbooks targeting Windows servers run using Python3, add the following to the Windows server group_vars:

ansible_python_interpreter: /usr/bin/python3  

Happy server management!

File system rights on mounted drives in Windows

As I repeatedly state, the same object oriented design that makes PowerShell potentially powerful in complex tasks, also makes it require ridiculous verbosity on our part to make it accomplish simple ones. Today’s post is a perfect example.

Consider a volume mounted to an NTFS mountpoint in a directory. Since this is an obvious afterthought in the file system design, setting access rights on the mountpoint directory won’t do you any good if you expect these rights to propagate down through the mounted file system. While the reason may be obvious once you think about the limitations in the design, it certainly breaks the principle of least astonishment. The correct way to set permissions on such a volume is to configure the proper ACL on the partition object itself.

In the legacy Computer Management MMC-based interface, this was simply a matter of right-clicking in the Disk Management module to change the drive properties, and then setting the correct values in the Security tab. In PowerShell, however, this isn’t a simple command, but a script with three main components:

  • Populate an ACL object with the partition object’s current security settings
  • Modify the properties of the ACL object
  • Commit the contents of the ACL object back into the partition object

Here’s how it’s done:

First we need to find the volume identifier. For this we can use get-partition | fl, optionally modified with a where, or ?, query, if we know additional details that can help narrow the search. What we’re looking for is something looking like the following example in our DiskPath property:

\\?\Volume{f0e7b028-8f53-42fa-952b-dc3e01c161d8}

Armed with that we can now fill an object with the ACL for our volume:

$acl = [io.directory]::GetAccessControl("\\?\Volume{f0e7b028-8f53-42fa-952b-dc3e01c161d8}\")

We then create a new access control entry (ACE):

$newace = New-Object -TypeName System.Security.AccessControl.FileSystemAccessRule -ArgumentList "DOMAIN\testuser", "ReadAndExecute, Traverse",
 "ContainerInherit, ObjectInherit", "None", "Allow"

The reason we must enter data in this order is because of the definition of the constructor for the access control entry object. There’s really no way of understanding this from within the interactive scripting environment; you just have to have a bunch of patience and read dry documentation, or learn from code snippets found through searching the web.

The next step is to load our new ACE into the ACL object:

$acl.SetAccessRule($newace)

What if we want to remove rights – for example the usually present Everyone entry? In that case we need to find every ACE referencing that user or group in our ACL, and remove it:

$acl.access | ?{$_.IdentityReference.Value -eq "Everyone"} | ForEach-Object { $acl.RemoveAccessRule($_)}

If we’ve done this job interactively, we can take a final look at our ACL to confirm it still looks sane by running $acl | fl.

Finally we’ll commit the ACL into the file system again:

[io.directory]::SetAccessControl("\\?\Volume{f0e7b028-8f53-42fa-952b-dc3e01c161d8}\",$acl)

And there we go: We’ve basically had to write an entire little program to make it, and the poor inventors of the KISS principle and of the principle of least astonishment are slowly rotating like rotisserie chickens in their graves, but we’ve managed to set permissions on a mounted NTFS volume through PowerShell.

NTFS mount points via PowerShell

As I mentioned in an earlier post, it’s sometimes useful to mount an additional drive in a directory on an existing drive, Unix-style, rather than presenting it with its own traditional Windows-style drive letter.

Here’s how we do it in PowerShell:

If the volume is already mounted to a drive letter, we need to find the disk number and partition number of the letter:

Get-Partition | select DriveLetter, DiskNumber, PartitionNumber | ft

DriveLetter DiskNumber PartitionNumber
----------- ---------- ---------------
                     0               1
          C          0               2
                     1               1
          E          1               2
                     2               1
          F          2               2
                     3               1
          G          3               2

In this example, we see that volume G corresponds to DiskNumber 3, PartitionNumber 2.

Let’s say we want to mount that disk under E:\SharedFiles\Mountpoint. First we need to make sure the directory exists. Then we’ll run the following commands:

Add-PartitionAccessPath -DiskNumber 3 -PartitionNumber 2 -AccessPath 'E:\SharedFiles\Mountpoint\'
Remove-PartitionAccessPath -DiskNumber 3 -PartitionNumber 2 -AccessPath 'G:\'

Summary

As usual, PowerShell is kind of “wordy”, but we do get our things done.

PowerShell for Unix nerds

(This post was inspired by a question on ServerFault)

Windows has had an increasingly useful scripting language since 2006 in PowerShell. Since Microsoft apparently fell in love with backend developers a while back, they’ve even ported the core of it to GNU/Linux and macOS. This is actually a big deal for us who prefer our workstations to run Unix but have Windows servers to manage on a regular basis.

Coming from a background in Unix shell scripting, how do we approach the PowerShell mindset? Theoretically it’s simple to say that Unix shells are string-based while PowerShell is object oriented, but what does that mean in practice? Let me try to present a concrete example to illustrate the difference in philosophy between the two worlds.

We will parse some system logs on an Ubuntu server and on a Windows server respectively to get a feel for each system.

Task 1, Ubuntu

The first task we shall accomplish is to find events that reoccur between 04:00 and 04:30 every morning.

In Ubuntu, logs are regular text files. Each line clearly consists of predefined fields delimited by space characters. Each line starts with a timestamp with the date followed by the time in hh:mm:ss format. We can find anything that happens during the hour “04” of any day in our retention period with a naïve grep for ” 04:”:

zgrep " 04:" /var/log/syslog*

(Note that I use zgrep to also analyze the archived, rotated log files.)

On a busy server, this particular search results in twice as much data to sift through as we originally wanted. Let’s complement our commands with some simple regular expressions to filter the results:

zgrep " 04:[0-2][0-9]:[0-5][0-9]" /var/log/syslog*

Mission accomplished: We’re seeing all system log events between 04:00:00 and 04:29:59 for each day stored in our log retention period. To clarify the command, each bracket represents one position in our search string and defines the valid characters for this specific position.

Bonus knowledge:
[0-9] can be substituted with \d, which translates into “any digit”. I used the longer form here for clarity.

Task 2, Ubuntu

Now let’s identify the process that triggered each event. We’ll look at a line from the output of the last command to get a feeling for how to parse it:

/var/log/syslog.7.gz:Jan 23 04:17:36 lbmail1 haproxy[12916]: xx.xxx.xx.xxx:39922 [23/Jan/2019:04:08:36.405] ft_rest_tls~

This can be translated into a general form:

<filename>:<MMM DD hh:mm:ss> <hostname> <procname[procID]>: <message>

Let’s say we want to filter the output from the previous command and only see the process information and message. Since everything is a string, we’ll pipe grep to a string manipulation command. This particular job looks like a good use case for GNU cut. With this command we need to define a delimiter, which we know is a space character, and then we need to count spaces in our log file format to see that we’re interested in what corresponds to ”fields” number 5 and 6. The message part of each line, of course, may contain spaces, so once we reach that field we’ll want to show the entire rest of the line. The required command looks like this:

zgrep " 04:[0-2][0-9]:[0-5][0-9]" /var/log/syslog* | cut -d ' ' -f 5,6-

Now let’s do the same in Windows:

Task 1, Windows

Again our task is to find events between 04:00 and 04:30 on any day. As opposed to our Ubuntu server, Windows treats each line in our log as an object, and each field as a property of that object. This means that we will get no results at best and unpredictable results at worst if we treat our log as a searchable mass of text.
Two examples that won’t work:

Wrong answer 1

get-EventLog -LogName System -After 04:00 -Before 04:30

This looks nice, but it implicitly only gives us log events between the given times this day.

Wrong answer 2

get-EventLog -LogName System | Select-String -Pattern "04:[0-2][0-9]:[0-5][0-9]"

Windows can use regular expressions just fine in this context, so that’s not a problem. What’s wrong here is that we’re searching the actual object instance for the pattern; not the contents of the object’s properties.

Right answer

If we remember that Powershell works with objects rather than plain text, the conclusion is that we should be able to query for properties within each line object. Enter the “where” or “?” command:

Get-EventLog -LogName System | ?{$_.TimeGenerated -match "04:[0-2][0-9]:[0-5][0-9]"}

What did we do here? The first few characters after the pipe can be read as “For each line check whether this line’s property “Time Generated” matches…“.

One of the things we “just have to know” to understand what happened here, is that the column name “Time” in the output of the Get-EventLog command doesn’t represent the actual name of the property. Looking at the output of get-eventlog | fl shows us that there’s one property called TimeWritten, and one property called TimeGenerated. We’re naturally looking for the latter one.

This was it for the first task. Now let’s see how we pick up the process and message information in PowerShell.

Task 2, Windows

By looking at the headers from the previous command, we see that we’re probably interested in the Source and Message columns. Let’s try to extract those:

Get-EventLog -LogName System | ?{$_.TimeGenerated -match "04:[0-2][0-9]:[0-5][0-9]"} | ft Source, Message

The only addition here, is that we call the Format-Table cmdlet for each query hit and tell it to include the contents of the Source and the Message properties of the passed object.

Summary

PowerShell is different from traditional Unix shells, and by trying to accomplish a specific task in both we’ve gained some understanding in how they differ:

  • When piping commands together in Unix, we’re sending one command’s string output to be parsed by the next command.
  • When piping cmdlets together in PowerShell, we’re instead sending entire objects with properties and all to the next cmdlet.

Anyone who has tried object oriented programming understands how the latter is potentially powerful, just as anyone who has “gotten” Unix understands how the former is potentially powerful. I would argue that it’s easier for a non-developer to learn Unix than to learn PowerShell, that Unix allows for a more concise syntax than PowerShell, and that Unix shells execute commands faster than PowerShell in many common cases. However I’m glad that there’s actually a useful, first-party scripting language available in Windows.

To get things done in PowerShell is mainly a matter of turning around and working with entire properties (whose values may but needn’t necessarily be strings) rather than with strings directly.

Workaround for broken connection management in Exchange

For legacy reasons (don’t even ask…) we still have an old NLB-based Exchange 2010 mail server farm, with a CASArray consisting of two servers, in front of a DAG cluster at work.

The interesting thing, of course, is when one of the CAS’s fail, Outlook clients don’t automatically start using the other CAS as you’d expect in a sane system. But which Outlook clients didn’t keep working seemed to be somewhat arbitrary.

A couple of minutes with my preferred search engine gave me the tools to show what’s wrong:

Get-Mailboxdatabase | ft Identity, RpcClientAccessServer

Identity RpcClientAccessServer
-------- ---------------------
Mailbox DB05 CAS1.tld
Mailbox DB03 CAS2.tld
...

The above example output shows that each database has a preferred CAS, and explains the apparent arbitrariness of clients refusing to connect to the remaining CAS.

The funny thing is that even after an hour and a half and way after NLB Manager stopped presenting the second CAS in its GUI, Exchange hadn’t understood that one of the members of the CASArray was down. The workaround is to manually tell each datastore to use the healthy CAS:

Set-MailboxDatabase "Mailbox DB03" -RPCClientAccessServer CAS1.tld

Get-Mailboxdatabase | ft Identity, RpcClientAccessServer


Identity RpcClientAccessServer
-------- ---------------------
Mailbox DB05 CAS1.tld
Mailbox DB03 CAS1.tld
...

Fortunately it looks as though modern Exchange solutions with real load balancers in front of them don’t experience this issue.

Monitoring mounted Windows volumes using Zabbix

Sometimes it’s nice to mount a separate disk volume inside a directory structure. For a concrete example: At work we have a legacy system that writes copious amounts of data to subfolders of a network share. While vSphere allows for pretty large vdisks, after you pass 8 TB or so, they become cumbersome to manage. By mounting smaller disks directly in this directory structure, each disk can be kept to a manageable size. 

First the bad news: the built-in filesystem discovery rules for the Zabbix Windows agent can only automatically enumerate legacy drive letters, so we get to know the status of the root file system, but not of the respective mounted volumes.

The good news, however, is that it’s a piece of cake to make Zabbix understand what you mean if you manually create data collection items for these subdirectories.

The key syntax in Zabbix 3 looks like this:

vfs.fs.size[G:/topdir/subdir,pfree]

The only thing to remember is that we’re sending forward slashes in our query to the server agent even though we’re running Windows.

Configuring Lenovo SR650 nodes for running vSphere

As usual nowadays, Lenovo SR650 servers come with energy saving presets that may seem ”green”, but which kill virtualization performance.

The regular way to get them running the way they should is to enter the UEFI setup at boot, go to UEFI Settings -> System Settings -> Operating Modes and choose ”Maximum Performance”. Unfortunately, on these servers, this removes the ability to set VMware EVC: the Enhanced vMotion Compatibility functionality that allows for live migration of virtual servers between hosts of different generations, for example when introducing a new cluster into a datacenter.

It turns out that what’s missing is one specific setting: ”MONITOR/MWAIT” must be set to ”Enabled”. It should be possible to first choose the ”Maximum Performance” scheme, then switch to the ”Custom” scheme and only change this single setting in Operating modes. In addition, we should also go to System Settings -> Devices and I/O Ports, and modify PCI 64-bit Resource Allocation to read ”Disabled”.  For reference, the complete checklist is available from Lenovo:

Processors.CStates=Disable
Processors.C1EnhancedMode=Disable
Processors.EnergyEfficientTurbo=Disable
Processors.MONITORMWAIT=Enable
Power.PowerPerformanceBias=Platform Controlled
Power.PlatformControlledType=Maximum Performance
DevicesandIOPorts.PCI64BitResourceAllocation=Disable
DevicesandIOPorts.MMConfigBase=3GB

After making these changes, we should be able to both run our workload at maximum performance and enable EVC to migrate workloads between server clusters utilizing CPUs from different generations.

It’s so fluffy!

(Or: Backblaze B2 cloud backups from a Proxmox Virtual Environment)

Backups are one of those things that have a tendency to become unexpectedly expensive – at least through the eyes of a non-techie: Not only do you need enough space to store several generations of data, but you want at least twice that, since you want to protect your information not only from accidental deletion or corruption, but also from the kind of accidents that can render both the production data and the backup unreadable. Ultimately, you’ll also want to spend the resources to automate as much of the process as possible, because anything that requires manual work will be forgotten at some point, and by some perverse law of the Universe, that’s when it would have been needed.

In this post I’ll describe how I’ve solved it for full VM/container backups in my lab/home environment. It’s trivial to adapt the information from this post to apply to regular file system backups. Since I’m using a cloud service to store my backups, I’m applying a zero trust policy to them at the cost of increased storage (and network) requirements, but my primary dataset is small enough that this doesn’t really worry me.

Backblaze currently offers 10 GB of B2 object storage for free. This doesn’t sound like a lot today, but it will comfortably fit several compressed and encrypted copies of my reverse proxy, and my mail and web servers. That’s Linux containers for you.

First of all, we’ll need an account at Backblaze. Save your Master Application Key in your password manager! We’ll need it soon. Then we’ll want to create a Storage Bucket. In my case I gave it the wonderfully inventive name “pvebackup”.

Next, we shall install a program called rclone on our Proxmox server. The version in the apt repository as I write this seems to have a bug vis à vi B2, that will require us to use the Master Application Key rather than a more limited Application Key specifically for this bucket. Since we’re encrypting our cloud data anyway, I feel pretty OK with this compromise for home use.

EDIT 2018-10-30: Downloading the current dpk package of rclone directly from the project site did solve this bug. In other words it’s possible and preferable to create a separate Application Key with access only to the backup bucket, at least if the B2 account will be used for other storage too.

# apt install rclone

Now we’ll configure the program:

# rclone config --config /etc/rclone.conf
Config file "/etc/rclone.conf" not found - using defaults
No remotes found - make a new one
n) New remote
s) Set configuration password
q) Quit config

Type n to create a new remote configuration. Name it b2, and select the appropriate number for Backblaze B2 storage from the list: In my case it was number 3.

The Account ID can be viewed in the Backblaze portal, and the Application Key is the master key we saved in our password manager earlier. Leave the endpoint blank and save your settings. Then we’ll just secure the file:

# chown root. /etc/rclone.conf && chmod 600 /etc/rclone.conf

We’ll want to encrypt the file before sending it to an online location. For this we’ll use gpg, for which the default settings should be enough. The command to generate a key is gpg –gen-key, and I created a key in the name of “proxmox” with the mail address I’m using for notification mails from my PVE instance. Don’t forget to store the passphrase in your password manager, or your backups will be utterly worthless.

Next, we’ll shamelessly steal and modify a script to be used for hooking into the Proxmox VE backup process (I took it from this github repository and repurposed it for my needs).

Edit 2018-10-30: I added the –b2-hard-delete option to the job-end phase of deleting old backups, since the regular delete command just hides files in the B2 storage, adding to the cumulative storage used.

#!/usr/bin/perl -w
# VZdump hook script for offsite backups to Backblaze B2 storage
use strict;

print "HOOK: " . join (' ', @ARGV) . "\n";

my $phase = shift;

if ($phase eq 'job-start' ||
        $phase eq 'job-end'  ||
        $phase eq 'job-abort') {

        my $dumpdir = $ENV{DUMPDIR};

        my $storeid = $ENV{STOREID};

        print "HOOK-ENV: dumpdir=$dumpdir;storeid=$storeid\n";

        if ($phase eq 'job-end') {
                        # Delete backups older than 8 days
                        system ("/usr/bin/rclone delete -vv --b2-hard-delete --config /etc/rclone.conf --min-age 8d b2:pvebackup") == 0 ||
                                die "Deleting old backups failed";
        }
} elsif ($phase eq 'backup-start' ||
        $phase eq 'backup-end' ||
        $phase eq 'backup-abort' ||
        $phase eq 'log-end' ||
        $phase eq 'pre-stop' ||
        $phase eq 'pre-restart' ||
        $phase eq 'post-restart') {
        my $mode = shift; # stop/suspend/snapshot
        my $vmid = shift;
        my $vmtype = $ENV{VMTYPE}; # lxc/qemu
        my $dumpdir = $ENV{DUMPDIR};
        my $storeid = $ENV{STOREID};
        my $hostname = $ENV{HOSTNAME};
        # tarfile is only available in phase 'backup-end'
        my $tarfile = $ENV{TARFILE};
        my $gpgfile = $tarfile . ".gpg";
        # logfile is only available in phase 'log-end'
        my $logfile = $ENV{LOGFILE};
        print "HOOK-ENV: vmtype=$vmtype;dumpdir=$dumpdir;storeid=$storeid;hostname=$hostname;tarfile=$tarfile;logfile=$logfile\n";
        # Encrypt backup and send it to B2 storage
        if ($phase eq 'backup-end') {
                system ("/usr/bin/gpg -e -r proxmox $tarfile") == 0 ||
                        die "Encrypting tar file failed";
                system ("/usr/bin/rclone copy -v --config /etc/rclone.conf $gpgfile b2:pvebackup") == 0 ||
                        die "Copying encrypted file to B2 storage failed";
        }
        # Copy backup log to B2
        if ($phase eq 'log-end') {
                system ("/usr/bin/rclone copy -v --config /etc/rclone.conf $logfile b2:pvebackup") == 0 ||
                        die "Copying log file to B2 storage failed";
        }
} else {
      die "got unknown phase '$phase'";
}
exit (0);

Store this script in /usr/local/bin/vzclouddump.pl and make it executable:

# chown root. /usr/local/bin/vzclouddump.pl && chmod 755 /usr/local/bin/vzclouddump.pl

The last cli magic for today will be to ensure that Proxmox VE actually makes use of our fancy script:

# echo "script: /usr/local/bin/vzclouddump.pl" >> /etc/vzdump.conf

To try it out, select a VM or container in the PVE web interface, select Backup -> Backup now. I use Snapshot as my backup method and GZIP as my compression method. Hopefully you’ll see no errors in the log, and the B2 console will display a new file with a name corresponding to the current timestamp and the machine ID.

Conclusion

The tradeoffs with this solution compared to, for example, an enterprise product from Veeam are obvious, but so is the difference in cost. For a small business or a home lab, this solution should cover the needs to keep the most important data recoverable even if something bad happens to the server location.

Replacing ZFS system drives in Proxmox

Running Proxmox in a root-on-zfs configuration in a RAID10 pool results in an interesting artifact: We need a boot volume from which to start our system and initialize the elements required to recognize a ZFS pool. In effect, the first mirror pair in our disk set will have (at least) two partitions: a regular filesystem on the first partition and a second partition to participate in the ZFS pool.

To see how it all works together, I tried failing a drive and replacing it with a different one.

Happy-case

If the drives would have had identical sector sizes, the operation would have been simple. In this case, sdb is the good mirror volume and sda is the new, empty drive. We want to copy the working partition table from the good drive to the new one, and then randomize the UUID of the new drive to avoid catastrophic confusion on the part of ZFS:

# sgdisk /dev/sdb -R /dev/sda
# sgdisk -G /dev/sda

After that, we should be able to use gdisk to view the partition table, to identify what partition does what, and simply copy the contents of the good partitions from the good mirror to the new drive:

# gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.1

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): p
Disk /dev/sda: 5860533168 sectors, 2.7 TiB
Logical sector size: 512 bytes
Disk identifier (GUID): xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 5860533134
Partitions will be aligned on 8-sector boundaries
Total free space is 0 sectors (0 bytes)

Number  Start (sector)    End (sector)  Size       Code  Name
   1              34            2047   1007.0 KiB  EF02  
   2            2048      5860516749   2.7 TiB     BF01  zfs
   9      5860516750      5860533134   8.0 MiB     BF07  

Command (? for help): q
# dd if=/dev/sdb1 of=/dev/sda1
# dd if=/dev/sdb9 of=/dev/sda9

Then we would add the new disk to our ZFS pool and have it resilvered:

# zpool replace rpool /dev/sda2

To view the resilvering process:

# zpool status -v
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep  1 18:48:13 2018
	2.43T scanned out of 2.55T at 170M/s, 0h13m to go
	1.22T resilvered, 94.99% done
config:

	NAME             STATE     READ WRITE CKSUM
	rpool            DEGRADED     0     0     0
	  mirror-0       DEGRADED     0     0     0
	    replacing-0  DEGRADED     0     0     0
	      old        UNAVAIL      0    63     0  corrupted data
	      sda2       ONLINE       0     0     0  (resilvering)
	    sdb2         ONLINE       0     0     0
	  mirror-1       ONLINE       0     0     0
	    sdc          ONLINE       0     0     0
	    sdd          ONLINE       0     0     0
	logs
	  sde1           ONLINE       0     0     0
	cache
	  sde2           ONLINE       0     0     0

errors: No known data errors

The process is time consuming on large drives, but since ZFS both understands the underlying disk layout and the filesystem on top of it, resilvering will only occur on blocks that are in use, which may save us a lot of time, depending on the extent to which our filesystem is filled.

When resilvering is done, we’ll just make sure there’s something to boot from on the new drive:

# grub-install /dev/sda
Installing for i386-pc platform.
Installation finished. No error reported.

Real life intervenes

Unfortunately for me, the new drive I tried had the modern 4 KB sector size (“Advanced Format / 4Kn”), while my old drives were stuck with the older 512 B standard. This led to the interesting side effect that my new drive was too small to fit volumes according to the healthy mirror drive’s partition table:

# sgdisk /dev/sdb -R /dev/sda
Caution! Secondary header was placed beyond the disk's limits! Moving the header, but other problems may occur!

In the end, what I ended up doing was to use gdisk to create a new partition table with volume sizes for partitions 1 and 9 as similar as possible to those of the healthy mirror (but not smaller!), entirely skipping the steps involving the sgdisk utility. The rest of the steps were identical.

The next problem I encountered was a bit worse: Even though ZFS in the Proxmox VE installation managed 4Kn drives just fine, there was simply no way to get the HP MicroServer Gen7 host to boot from one, so back to the old 3 TB WD RED I went.

Conclusion

Running root-on-zfs in a striped mirrors (“RAID10”) configuration complicates the replacement of any of the drives in the first mirror pair slightly compared to a setup where the ZFS pool is used for storage only.

Fortunately the difference is minimal, and except for the truly dangerous syntax and unclear documentation of the sgdisk command, replacing a boot disk really boils down to four steps:

  1. Make sure the relevant partitions exist.
  2. Copy non-ZFS-data from the healthy drive to the new one.
  3. Resilver the ZFS volume.
  4. Install GRUB.

In a pure data disk, the only thing we have to think about is step 3.

On the other hand, running too new hardware components in old servers doesn’t always work as intended. Note to the future me: Any meaningful expansion of disk space will require newer server hardware than the N54L-based MicroServer.