Among all the great new features and improvements made to vSphere 6.5, some of the ones I am most exited about are the improvements to DRS and HA. So lets zoom into those briefly.
This information comes mostly from VMware pre-sales marketing material and should be considered preliminary. I hope to try out some of these features in our lab once the bits become available.
vCenter Server Appliance (VCSA) now supports a HA mode + Witness.
This appears to be similar in some respects to the NSX Edge HA function. But with one seriously important addition: a witness.
In any High-Availability, clustering or other kind of continuous-uptime solution, where data integrity or ‘state’ is important, you need a witness or ‘quorum’ function to determine which of the 2 HA ‘sides’ becomes the master of the function, and thus may make authoritative writes to data or configuration. This is important if you encounter the scenario of a ‘split’ in your vSphere environment, where both the HA members could become isolated from each other. The witness helps decide which of the 2 members must ‘yield’ to the other. I expect the loser turns its function off. The introduction of a witness also helps the metro-cluster design. In case of a metro-cluster network split, the witness now makes sure you cannot get a split-brain vcenter.
The HA function uses its own private network with dedicated adapter, that is added during configuration. There is a basic config and an advanced option to configure. I assume the latter lets you twiggle the nobs a bit more.
There are some caveats. At release this feature only works if you are using an external Platform Services Controller. So assume this will not work if you run all the vSphere functions inside 1 appliance. At least not at GA.
It should be noted that the new integrated vSphere Update Manager for the VCSA, will also failover as part of this HA feature.It should also be noted that this feature is only available in Enterprise+
Simplified HA Admission Control
vSphere 6.5 sees some improvements to HA admission control. As with many of the vSphere 6.5 enhancements, the aim here is to simplify or streamline the configuration process.
The various options have now been hidden under a general pulldown menu, and combine with the Host Failures Cluster Tolerates number, which now acts as input to whatever mode you select. In some ways this is now more like the VSAN Failures To Tolerate setting. You can of course, still twiggle the knobs if you so wish.
Additionally to this, the HA config will give you a heads up if it expects your chosen reservation with potentially impact performance while doing HA restarts. You are now also able to guard against this by reserving a resource percentage that HA must guarantee during HA restarts. These options give you a lot more flexibility.
Admission control now also listens to the new levels of HA Restart priority, where it might not restart the lowest levels if they would violate the constraints. These 2 options together give you great new flexibility in controller the HA restart and the resources it takes (or would take).
vShere HA Restart Priorities
At long last, vSphere now supports more than 3 priority-levels. This adds a lot more flexibility to your HA design. In our own designs, we already assigned infrastructure components to the previous ‘high’ level, customer production workloads to ‘medium’ and everything else to ‘low’. What I was missing at the time was differentiate between the Infra components. For example, I would want Active Directory to start -before- many other Infra services that rely on AD authentication. Syslogging is another service you want to get back up as soon as possible. And of course vCenter should ideally come back before many other VMware products that rely on it. Also allows you to make some smart sequencing decisions in regard to NSX components. I would restart NSX controllers and the Edge DLR and Edge tenant routers first, for example. I am sure you can think of your own favorite examples.
As mentioned previously, these new expanded restart levels go hand-in-hand with the new admission control options.
vSphere HA Orchestrated Restart
This is another option that I have wanted to see for a very long time. I have seen many HA failover in my time, and always the most time is spent afterwards by the application owners, putting the pieces back together again cause things came up in the wrong order.
vSphere Orchestrated Restart allows you to create VM dependency rules, that will allow a HA Failover to restart the VMs in the order that best serves the application. This is similar to the rule sets we know from SRM.
Naturally you will need to engage your application teams to determine these rules. I do wonder about the limits here. In some of the environments we manage, there could potentially be hundreds of these kinds of rules. But you don’t want to make it too hard for HA to calculate all this, right?
This is a ‘new’ feature, in so far that that this is a new deeper level of integration natively to vCenter, and can leverage the new ‘quarantine mode’ for ESX hosts. Similar behavior has already for years been a feature of the Dell Management Plug-in for vCenter, for example; where ‘maintenance mode’ action was triggered as script action from a vCenter alert. By leveraging ‘quarantine mode’ , new modes of conduct are enabled in dealing with partially failed hosts, for example pro-actively migrating off VMs, but based on specific failure rules, instead of an all-or-nothing approach.
For years we have only ever had 2 possible host states: Maintenance and.. well, not in maintenance 🙂
Quarantine Mode is is the new middle ground. It can be leverages tightly with the new proactive HA feature mentioned above and integrates with DRS, but is above all just a useful mode to employ operationally.
The most important thing to bare in mind, is that Quarantine mode does not by default guarantee that VMs cannot or will not land on this host. An ESH host in quarantine can and will still be used to satisfy VM demand where needed. Think of reservations and HA failover. DRS, however, will try to avoid placing VMs on this host if possible.
Operationally, this is very similar to what we would already do in many ‘soft’ failure scenarios for hosts: – we will put DRS to semi-auto, and slowly start to evacuate the host, usually ending up putting it in maintenance at the end of the day.
DRS Policy Enhancements
Again more streamlining. For us vSphere admins with a case of OCD, the new ‘even distribution’ model is quite relaxing. VMware describes this, endearingly, as the ‘peanut butter’ model. Personally I will refer to it as the Nutella model, because Nutella is delicious!
This of course refers to the ‘even spread’ of VMs across all hosts in your cluster.
This, and the other options added to DRS, are interesting from both a performance and a risk point-of view. You avoid the ‘all your eggs in one basket’ issue, for example. Naturally the CPU over-commitment setting is especially interesting in VDI environments, or any other deployment that would benefit from good continuous CPU response.
DRS will now attempt to balance load based also on the network saturation level of host, besides only looking at CPU and RAM. However it will prioritize CPU and RAM above all else. This is on a best-effort basis so no guarantees.
Re-encrypting my work laptop harddrive.
Veracrypt is the successor to Truecrypt and its code has been community-vetted to insure there are no ‘back doors’ in it (and its security can be independently verified).
The only downside it has is that by default, it uses a rather high header key derivation iteration value (a lot higher than truecrypt). Meaning that it can take several minutes to boot your laptop. This is a frequent complaint by new Veracrypt users.
The workaround is simple. As long as you use a password that is longer than 20 characters, you are allowed to reduce the amount of iterations substantially by using a lower multiplier value (called a PIM), that you type in at boot time after your password. The multiplayer may be as low as 1, which will more or less instantly mount your boot partition.
For the purposes of theft-risk-reduction by common criminals, this is probably more than enough protection. However, if you are seeking to thwart the NSA which may try to brute-force your password using a server farm for 5 years, it may not be 😉
On ESX 5.5U3, I recently ran into an annoying issue with HA. vSphere had recently been updated, but the hosts had not been all yet received the very latest version of the FDM (fault domain manager, aka ‘HA’) agent.
During some routine maintenance work, a particular host was taken in and out of maintenance mode a few times. Eventually it was observed to no longer properly complete HA configuration. Checking the host status in the UI, it would seemingly get stuck in the install phase of the latest FDM agent.
Checking the FDM installer log ( /var/run/log/fdm-installer.log ) , I found the following:
fdm-installer:  2016-08-25 11:16:13: Logging to /var/run/log/fdm-installer.log
fdm-installer:  2016-08-25 11:16:13: extracting vpx-upgrade-installer/VMware-fdm-eesx-2-linux-4180647.tar
 2016-08-25 11:16:13: exec rm -f /tmp/vmware-root/ha-agentmgr/upgrade
 2016-08-25 11:16:13: status = 0
 2016-08-25 11:16:13: exec cd /tmp/vmware-root/ha-agentmgr/vpx-upgrade-installer
 2016-08-25 11:16:13: status = 0
fdm-installer:  2016-08-25 11:16:13: Installing the VIB
fdm-installer:  2016-08-25 11:16:18: Result of esxcli software vib install -v=/tmp/vmware-root/ha-agentm
fdm-installer: Error in running rm /tardisks/vmware_f.v00:
fdm-installer: Return code: 1
fdm-installer: Output: rm: can’t remove ‘/tardisks/vmware_f.v00’: No such file or directory
fdm-installer: It is not safe to continue. Please reboot the host immediately to discard the unfinished update.
fdm-installer: Please refer to the log file for more details.
fdm-installer:  2016-08-25 11:16:18: There is a problem in installing fdm vib. Remove the vib…
 2016-08-25 11:16:18: exec esxcli software vib remove -n=vmware-fdm.vib
No VIB matching VIB search specification ‘vmware-fdm.vib’.
Please refer to the log file for more details.
 2016-08-25 11:16:19: status = 1
fdm-installer:  2016-08-25 11:16:19: Unable to install HA bundle because esxcli install return 1
This was decidedly odd. I checked the /tardisks mount, and could, indeed, not found any vmware_f.v00 file. It was trying to ‘remove’ (unmount, as it turns out) a file that did not exist. And this was breaking the uninstall process.
This page was useful in understanding the sequence of events: http://vm-facts.com/main/2016/01/23/vmware-ha-upgrade-agent-issue-troubleshooting/
What I can only speculate as to what happened, is that at some point in the sequence of taking the host in and out of maintenance, the FDM uninstall somehow failed to complete properly, and left the host image list in a strange, invalid state.
Querying the host in this state, it listed the old FDM agent as still installed:
# esxcli software vib list | grep -i fdm
vmware-fdm 5.5.0-3252642 VMware VMwareCertified 2016-02-03
Yet a force uninstall of the VIB would fail with the same error.
fdm-uninstaller:  2016-08-24 11:42:30: exec /sbin/esxcli software vib remove -n=vmware-fdm
Message: Operation finished successfully.
Reboot Required: false
VIBs Removed: VMware_bootbank_vmware-fdm_5.5.0-3252642
fdm-uninstaller:  2016-08-24 11:43:58: status = 1
fdm-uninstaller:  2016-08-24 11:43:58: exec /sbin/chkconfig –del vmware-fdm
Together with VMware support, we tried various tricks, including copying a fresh imgdb.tgz from a different host to /bootbank , and running the latest installer and uninstaller of the FDM agent manually.
By the way, the source that vCenter uses for the FDM agent installer and uninstaller, is (on Windows) “Program Files\VMware\Infrastructure\VirtualCenter Server\upgrade”
If you wish to try to run these files directly on an ESX host, simply copy them to the host /tmp and chmod them to 777. They are then executable.
But in all cases, the FDM installer first will always try an uninstall of the previous verison, which always includes trying to unmount /tardisks/vmware_f.v00
Now /tardisks is a bit of a strange bird, and deserves some explanation. This VMware research paper turned out to be a very excellent document in understanding what /tardisks actually is and does: https://labs.vmware.com/vmtj/visorfs-a-special-purpose-file-system-for-efficient-handling-of-system-images
In short, it is a directory that hosts mounted TAR files, that are loading at boot time from /bootbank (or /altbootbank). These TAR files are mounted as live filesystems, using what VMware calls VisorFS. Which makes the mounted TAR files behave as part of the file system. This has various administrative and management advantages as the paper linked above explains.
It is therefore not possible to simply copy a missing file to /tardisks in order to force the FDM uninstaller to properly complete.
You can list which TAR filesystems ESX has mounted, by running the command esxcli system visorfs tardisk list
This list will be the same as the filelist of /tardisks
Of note: when you re-install FDM, just after the install, the ‘system’ flag will be set to false, until you reboot. After a reboot, it will be set to true like all other modules.
On a normal host, you will find the FDM VIB listed here.
In our case, this entry was missing, even though the Vib list command showed it as installed.
So it seemed to me that if ESX needed to mount these TAR files at boot time, there was probably a command it used to do this.
Or in any case, I found it likely such a command should exist, if only for troubleshooting purposes.
I wondered that if I could mount this TAR manually, the uninstaller might proceed normally.
A few minutes of google-fu later, I stumbled on this:
Creating and Mounting VIB files in ESXi
Now the VMware engineer noted that the vmkramdisk command has been deprecated since 4.1, but to both our surprise (and delight) it was still there in 5.5, and still did its job.
We manually mounted the /bootbank/vmware-f.v00 using the command vmkramdisk /bootbank/vmware-f.v00
Immediately you will find vmware-f.v00 listed under /tardisks, and using esxcli system visorfs tardisk list
And as predicted, the installed passed through the uninstall this time, without a hitch, and then installed the new version of the HA agent. We rebooted the host just to be sure it would properly load the new VIB each time. And it did, and managed to initiate HA in the cluster without any issues thereafter.
We all know about the VMware case numbers. Each SR you open gets a nice number.
Internally, VMware has a problem database. Newly found bugs end up in there. And if you spend a lot of time with VMware support, you will end up hearing a lot about these internal PR (problem reports).
Here is a cool fact you may not know. Hidden in the HTML source of the public release notes that VMware produces, are the actual PR numbers associated with the issue that is described as having been fixed. (or not fixed).
Take the NSX 6.2.0 release notes for example: https://www.vmware.com/support/nsx/doc/releasenotes_nsx_vsphere_620.html
View the source:
And if you scroll down to the fixes, you will find:
Its those DOCNOTE numbers, that are the actual PR numbers. Sometimes they also list the public KB number too. But there are far more interal PR numbers than there are public KB equivalents.
So how can this help you?
Well for one thing, you can start asking intelligent questions of VMware support, like: ‘Has something like my issue been reported before in the PR database?’ (prompting the engineer to go look for it, which they don’t always do on their own accord 😉
Or you can use it as a validation check. If your issue is scheduled to be fixed in an upcoming patch, ask the support engineer for the associated PR numbers! That way, you can verify yourself in the release notes, if the fix was included!
The process of getting a new patch or update through QA is quite involved, and sometimes fixes fall by the wayside. This is not immediately known to everyone inside VMware. So its always worth checking yourself; trust but verify.