Rescuing vVol-based virtual machines

Background

As mentioned in a previous post, I had a really bad experience with vVols presented from IBM storage. Anyhow, the machines must be migrated to other storage, and reading how vVols work, that’s a scary prospect.

The good thing: Thanks to Veeam, I have excellent backups.

The bad thing: Since they’re dependent on the system’s ability to make snapshots, I only have backups up until my vVols failed. Troubleshooting, identifying the underlying issue, having VMware look at the systems and point at IBM, and finally realizing IBM won’t touch my issue unless I sign a year’s worth of software support agreements took several days, during which I’ve had no new backups for the affected VMs.

Fortunately, most of the systems I had hosted on the failed storage volumes were either more or less static, or stored data on machines on regular LUNs or vSAN.

The Three Trials Methods

Veeam restore

Templates and turned off-machines were marked as Inaccessible in the vCenter console. Since they had definitely seen no changes since the vVol storage broke down, I simply restored them to other datastores from the latest available backup.

VMware Converter

I attempted to use a Standalone VMware Converter to migrate a Ubuntu VM, but for some reason it kept having kernel panics on boot time. I suspect it may have something to do with the fact that Converter demands that the paravirtual SCSI controller is replaced with the emulated LSI one. I have yet to try with a Windows server, but my initial tests made me decide to only use Converter as an extra backup.

Cold migration

This is one method I was surprised worked, and which simplified things a lot. It turns out that – at least with the specific malfunction I experienced – turning off a VM that has been running doesn’t actually make it inaccessible to vCenter. And since a turned off VM doesn’t require the creation of snapshots to allow migration, moving it to accessible storage was a breeze. This is what I ended up doing with most of the machines.

Summary

It turns out that at least for my purposes, the vVols system decided to ”fail safe”, relatively speaking, allowing for cold migration of all machines that had been running when the management layer failed. I had a bit of a scare when the cold migration of a huge server failed due to a corrupt snapshot, but a subsequent retry where I moved the machine to a faster datastore succeeded, meaning I did not have to worry about restoring data from other copies of the machine.

Why I’m moving away from vVols on IBM SVC storage

Virtual Volumes, or vVols, sound like a pretty nice idea: We present a pool of storage to vCenter, which in turn gets control of storage events within that pool via something called VASA providers. Benefits of this include the following:

  • vVols allow for policy-based storage assignment.
  • We get to use an “inverted” snapshotting method, where snapshot deletions (i.e. commits of snapshotted data), which are most commonplace are almost instantaneous, at the cost of more expensive rollbacks.
  • vCenter gets access to internal procedures in the storage solution instead of having to issue regular SCSI commands to the controllers.

As presented by VMware, the solution should be pretty robust: The VASA providers present an out-of-band configuration interface to vCenter, while the actual data channel is completely independent of them. As recommended by VMware, the VASA providers by themselves should also be stateless, meaning that in case of total loss of them, recovering should only be a matter of deploying new ones, which should read metadata about the storage from the storage itself and present it back to vCenter.

So what’s the drawback?

If your VASA providers are offline, you can’t make changes to vVol storage, and any vVol-based VMs that aren’t actively running become unavailable. Not being able to make changes to vVol storage is a pretty big deal, because guess what: Snapshots are a vVol storage change. And snapshots are pretty much a requirement for VM backups, which for any production environment is a daily recurring task.

I’ve been presenting vVols from our V9000 and V7000 storage solutions via IBM Spectrum Control Base Edition for quite some time now, and have really liked it. Except when it stopped working. Because it did. Several times. Firmware update on the SAN? Spectrum Control stopped working. HA failover between Spectrum Control nodes? Not reliable. Updates to the operating system on a Spectrum Control node? At least once I couldn’t get the node back online, and had to restore a VM backup. And right now I’m having an issue where some necessary metadata string apparently contains untranslatable unicode characters because someone – possibly even me – used the Swedish letters å, ä, and Ä somewhere without thinking.

I’ve opened a case with IBM support to get things running again, and as soon as I have, I’m migrating everything off of my vVols on SVC, and replacing those storage pools with regular LUNs. From now on I’m sticking to vSAN when I want the benefits of modern object storage for my virtualization environment.