This post describes our experience with upgrading from EMC VPLEX VS2 to VS6 hardware, in a seamless non-disruptive fashion.
EMC VPlex is a powerful storage virtualization product and I have had several years of experience with it in a active-active metro-storage-cluster deployment. I am a big fan. Its rock-solid, very intuitive to use and very reliable if set up correctly. Check out these 2 videos to learn what it does.
Around August 2016, EMC released VPLEX VS6, the next generation of hardware for the VPLEX platform. In many aspects it is, generally, twice as fast, utilizing the latest Intel chipset and 16Gbe FC, with an Infiniband interconnect between the directors and a boatload of extra cache.
One of our customers recently wanted their VS2 hardware either scaled-out or replaced by VS6 for performance reasons. Going for a hardware replacement was more cost-effective than scaling out by adding more VS2 engines.
Impressively the in-place upgrade of the hardware could be done none-disruptively. This is achievable through the clever way the GeoSynchrony firmware is ‘loosely coupled’ from the hardware. The VS6 hardware is a significant upgrade over the VS2, yet they are able to run the same firmware version of GeoSynchrony without the different components of VPLEX being aware of the fact. This is especially useful if you have VPLEX deployed in a metro-cluster.
So to prepare for a seamless upgrade from VS2 to VS6, your VS2 hardware needs to be on the S6 firmware. The exact same release as the VS6 hardware you will be transitioning to.
VPLEX consists of ‘Engines’ that house 2 ‘directors’. You can think of these as broadly analogous to the service processors in an array. With the main difference being that they are active-active. They share a cache and are able to handle i/o for the same LUN’s simultaneously. If you add another engine with 2 extra directors, now you have 4 directors all servicing the same workload and load-balancing the work.
Essentially the directors form a cluster together, directly over their infiniband, or in metro-cluster, also, partially, over fiber channel across the WAN. Because they are decoupled from the management plane, they can continue operating even when the management plane is temporarily not available. It also means that, if their firmware is the same, even though the underlying hardware is a generation apart, they can still form a cluster together without any of them noticing. This is what makes the non-disruptive upgrade, even in a metro-cluster configuration, possible. It also means that you can upgrade one side of the VPLEX metro-cluster separately, and a day or even a week apart from the other side. This makes planning an upgrade more flexible. There is a caveat however, and that is a possible slight performance hit on your wan-com replication between the VS2 and VS6 sides, so you don’t want to keep in that state for all too long.
VPLEX VS2 hardware. 1 engine consisting of 2 directors.
Because all directors running the same firmware are essentially equivalent, even though they might be of different hardware generations, you can almost predict what the non-disruptive hardware upgrade looks like. Its more or less the same procedure as if you where to replace a defective director. The only difference is that the old VS2 hardware is now short-circuited to the new VS6 hardware, which enables the new VS6 directors to take over i/o and replication from the old directors one at a time.
The only thing the frontend hosts and the backend storage ever notice, is temporarily losing half their storage paths. So naturally, you need to have your multipathing software on your hosts in order. This will most likely be EMC powerpath, which handles this scenario flawlessly.
The most impressive trick of this transfer, however, is that the new directors will seamlessly take over the entire ‘identity’ of the old directors. This includes -everything- unique about the director, including, crucially, the WWNs. This is important because transferring the WWNs is the very thing that makes the transition seamless. It does of course require you to have ‘soft zoning’ in place, in the case of FC. As a director port WWN will suddenly, in the space of about a minute, vanish from 1 port, and pop up on another port. But if you have your zoning set up correctly, you do not even have to touch your switches at all.
And yes, that does mean you need double cabling, at least temporarily. The old VS2 is of course connected to your i/o switches, and the new VS6 will need to be connected simultaneously on all its ports, during the upgrade process.
So have fun cabling those 😉
That might be a bit of a hassle, but its a small price to pay for such a smooth and seamless transition.
To enable the old VS2 hardware, (which used FC to talk to his partner director over local-com), to talk to the new VS6 directors (which use Infiniband) during the migration, it is necessary to temporary insert an extra FC module into the VS6 directors. During a specific step in the upgrade process, the VS2 is connected to the VS6, and for a brief period, your i/o is being served from a combination of a VS2 and VS6 director that are sharing volumes and cache with eachother. This is a neat trick.
Inserting the temp IO modules:
As a final step, the old VS2 management server settings are imported to the new redundant VS6 management modules.
In VS6, these management modules are now integrated into the director chassis, and act in a active-passive failover mode. This is a great improvement over the single on-redundant VS2 management server, with its single power supply (!)
–==UPDATE May 2018==–
As Daniel points out in the comments, “In VS6, these management modules are now integrated into the director chassis, and act in a active-passive failover mode” This statement is not true. The secondary MMCS-B just hold the firmware and logs and does not failover to incase MMCSA fails.”
I have confirmed this with my own EMC contacts. In short, MMCS-B is envisioned to be used either as a failover/passive target for the management services that run on MMCS-A, or potentially some kind of clustered active/active type model whereby either MMCS running could handle work or jobs related to storage administration. But the main point is, that this is currently not yet the case (as of June 2018). The hardware is there, but its just not being used at all. So you still have only 1 management server, the MMCS-A.
It should be noted, that the EMC engineers at the time they did this upgrade with us, either did not know or did not mention this.
–==/end UPDATE May 2018==–
New management modules:
The new management server hardware completely takes over the identity and settings of the old management server. This even includes IP address, customer cluster names and the cluster serial numbers. The VS6 will adopt the serial numbers of your VS2 hardware. This is important to know from a EMC support point-of-view and may confuse people.
The great advantage is that all local settings and accounts, and all monitoring tools and alerting mechanisms flawlessly work with the new hardware. For example we have a powershell script that uses the API to check the health status. This script worked immediately with the VS6 without having to change anything. Also VIPR SRM only need a restart of the VPLEX collector,whereafter it continued collecting without having to change anything. The only thing I have found that did not get transferred where the SNMP trapping targets.
After upgrade, the benefit of the new VS6 hardware was immediately noticeable. Here is a graph of average aggregate director CPU use, from EMC VIPR SRM:
As this kind of product is fundamental to your storage layer, its stability and reliability, especially during maintenance work like firmware and hardware upgrades, is paramount, and is taken seriously by EMC. Unlike other EMC products like VNX, you are not expected or indeed allowed to update this hardware yourself, unless you are a certified partner. Changes that need to be done to your VPLEX platform go through a part of EMC called the ‘Remote Pro-active’ team.
There is a process that has to be followed which involves getting them involved early, a round of pre-validation health-checks, and the hands-on action of the maintenance job either remotely via webex, or on sight by local EMC engineers if that is required. A hardware upgrade will always require onsite personal, so make sure they deliver pizza to the datacenter! If an upgrade goes smoothly, expect it to take 4-5 hours. That includes all the final pre-checks, hardware work, cabling, transfer of management identity to the VS6, and decommissioning of the VS2 hardware.
In the end the upgrade was a great success, and our customer had zero impact. Pretty impressive for a complete hardware replacement of such a vital part of your storage infra.
Finally, here is the text of the September2016 VPLEX Uptime bulletin with some additional information about the upgrade requirements. Be aware that this may be deprecated, please consult with EMC support for the latest info.
There is an EMC community thread where people have been leaving their experiences with the upgrade, have a look here: https://community.emc.com/message/969664