Among all the great new features and improvements made to vSphere 6.5, some of the ones I am most exited about are the improvements to DRS and HA. So lets zoom into those briefly.
This information comes mostly from VMware pre-sales marketing material and should be considered preliminary. I hope to try out some of these features in our lab once the bits become available.
vCenter Server Appliance (VCSA) now supports a HA mode + Witness.
This appears to be similar in some respects to the NSX Edge HA function. But with one seriously important addition: a witness.
In any High-Availability, clustering or other kind of continuous-uptime solution, where data integrity or ‘state’ is important, you need a witness or ‘quorum’ function to determine which of the 2 HA ‘sides’ becomes the master of the function, and thus may make authoritative writes to data or configuration. This is important if you encounter the scenario of a ‘split’ in your vSphere environment, where both the HA members could become isolated from each other. The witness helps decide which of the 2 members must ‘yield’ to the other. I expect the loser turns its function off. The introduction of a witness also helps the metro-cluster design. In case of a metro-cluster network split, the witness now makes sure you cannot get a split-brain vcenter.
The HA function uses its own private network with dedicated adapter, that is added during configuration. There is a basic config and an advanced option to configure. I assume the latter lets you twiggle the nobs a bit more.
There are some caveats. At release this feature only works if you are using an external Platform Services Controller. So assume this will not work if you run all the vSphere functions inside 1 appliance. At least not at GA.
It should be noted that the new integrated vSphere Update Manager for the VCSA, will also failover as part of this HA feature.It should also be noted that this feature is only available in Enterprise+
Simplified HA Admission Control
vSphere 6.5 sees some improvements to HA admission control. As with many of the vSphere 6.5 enhancements, the aim here is to simplify or streamline the configuration process.
The various options have now been hidden under a general pulldown menu, and combine with the Host Failures Cluster Tolerates number, which now acts as input to whatever mode you select. In some ways this is now more like the VSAN Failures To Tolerate setting. You can of course, still twiggle the knobs if you so wish.
Additionally to this, the HA config will give you a heads up if it expects your chosen reservation with potentially impact performance while doing HA restarts. You are now also able to guard against this by reserving a resource percentage that HA must guarantee during HA restarts. These options give you a lot more flexibility.
Admission control now also listens to the new levels of HA Restart priority, where it might not restart the lowest levels if they would violate the constraints. These 2 options together give you great new flexibility in controller the HA restart and the resources it takes (or would take).
vShere HA Restart Priorities
At long last, vSphere now supports more than 3 priority-levels. This adds a lot more flexibility to your HA design. In our own designs, we already assigned infrastructure components to the previous ‘high’ level, customer production workloads to ‘medium’ and everything else to ‘low’. What I was missing at the time was differentiate between the Infra components. For example, I would want Active Directory to start -before- many other Infra services that rely on AD authentication. Syslogging is another service you want to get back up as soon as possible. And of course vCenter should ideally come back before many other VMware products that rely on it. Also allows you to make some smart sequencing decisions in regard to NSX components. I would restart NSX controllers and the Edge DLR and Edge tenant routers first, for example. I am sure you can think of your own favorite examples.
As mentioned previously, these new expanded restart levels go hand-in-hand with the new admission control options.
vSphere HA Orchestrated Restart
This is another option that I have wanted to see for a very long time. I have seen many HA failover in my time, and always the most time is spent afterwards by the application owners, putting the pieces back together again cause things came up in the wrong order.
vSphere Orchestrated Restart allows you to create VM dependency rules, that will allow a HA Failover to restart the VMs in the order that best serves the application. This is similar to the rule sets we know from SRM.
Naturally you will need to engage your application teams to determine these rules. I do wonder about the limits here. In some of the environments we manage, there could potentially be hundreds of these kinds of rules. But you don’t want to make it too hard for HA to calculate all this, right?
This is a ‘new’ feature, in so far that that this is a new deeper level of integration natively to vCenter, and can leverage the new ‘quarantine mode’ for ESX hosts. Similar behavior has already for years been a feature of the Dell Management Plug-in for vCenter, for example; where ‘maintenance mode’ action was triggered as script action from a vCenter alert. By leveraging ‘quarantine mode’ , new modes of conduct are enabled in dealing with partially failed hosts, for example pro-actively migrating off VMs, but based on specific failure rules, instead of an all-or-nothing approach.
For years we have only ever had 2 possible host states: Maintenance and.. well, not in maintenance 🙂
Quarantine Mode is is the new middle ground. It can be leverages tightly with the new proactive HA feature mentioned above and integrates with DRS, but is above all just a useful mode to employ operationally.
The most important thing to bare in mind, is that Quarantine mode does not by default guarantee that VMs cannot or will not land on this host. An ESH host in quarantine can and will still be used to satisfy VM demand where needed. Think of reservations and HA failover. DRS, however, will try to avoid placing VMs on this host if possible.
Operationally, this is very similar to what we would already do in many ‘soft’ failure scenarios for hosts: – we will put DRS to semi-auto, and slowly start to evacuate the host, usually ending up putting it in maintenance at the end of the day.
DRS Policy Enhancements
Again more streamlining. For us vSphere admins with a case of OCD, the new ‘even distribution’ model is quite relaxing. VMware describes this, endearingly, as the ‘peanut butter’ model. Personally I will refer to it as the Nutella model, because Nutella is delicious!
This of course refers to the ‘even spread’ of VMs across all hosts in your cluster.
This, and the other options added to DRS, are interesting from both a performance and a risk point-of view. You avoid the ‘all your eggs in one basket’ issue, for example. Naturally the CPU over-commitment setting is especially interesting in VDI environments, or any other deployment that would benefit from good continuous CPU response.
DRS will now attempt to balance load based also on the network saturation level of host, besides only looking at CPU and RAM. However it will prioritize CPU and RAM above all else. This is on a best-effort basis so no guarantees.