—Update 21st March, vBrownbag episode on AWS that these slides are from, is now posted on youtube: https://www.youtube.com/watch?v=u8rWI5tuSq8 —
— Update 15th March ~5pm CET, added some extra info and clarified some points—
More details regarding VMware Cloud on AWS are starting to come out of VMware. Tonight I attended an awesome #vBrownbag webinar on #VMWonAWS, hosted by Chris Williams (@mistwire) and Ariel Sanchez ( @arielsanchezmor).
Presenting where Adam Osterholt (@osterholta) Eric Hardcastle (@CloudGuyVMware) and Paul Gifford @cloudcanuck
Here are some of the slides and highlights that stood out for me. Information is not NDA and permission was given to repost slides.
VMware Cross-Cloud Architecture. A nice slide that summorises the VMware strategy going forward. Expect VMware cloud to pop up in more places, like IBM Cloud. More info about VMware cloud strategy here
Important to note here, is that this is a complete service offering, meaning its fully licensed. You do not need to bring your own licenses to the table. So you get the full benefit of technologies like vSAN and NSX as part of the offering.
Skillsets.. this is a huge selling point. Many native cloud deployments require your admins to know AWS or cloud-native specific tools and automation scripting languages. VMware Cloud on AWS (VMWonAWS) removes that barrier-to-entry completely. If you can administer a VMware-based cloud stack today , you can administer VMware Cloud on AWS.
You have access to AWS sites around the world to host VMWonAWS. What is to note however is that, because these are vSphere clusters on bare-metal, where you instantiate your VMware environment is where you are bound in certain ways.
Initial roleout will be Oregon. The followed by an EMEA location. Sometime around mid-2017. (from announcement to GA in about a year.. not bad!!)
With the recent S3 outage in mind, asked specifically about things like stretched-cluster and other advanced high-availability features inside AWS, and these will not be initially part of the offering. However you can always move your VMs off and on VMWonAWS via x-vmotion. More or that later.
VMWonAWS will use customized HTML interfaces throughout. No flash here! 🙂
But if you are a bit of a masochist and you like the flash/flex client, it will be available to you anyway.
The frontend provisioning component will include its own API interface. What you see below is a mockup and subject to change.
Administering your cluster uses a custom and locked-down version of the already available HTML5 client.
Its important to note here, that VMware will administer, and upgrade their software inside these environments themselves. They will keep an n-1 backward compatibility, but if you have a lot of integration talking against this environment, operationally you will have to keep up with updating your stuff. Think of vRA/vRO workflows and other automation you might have talking to your VMWonAWS instances. This may be a challenge for customers.
Demonstrated below is a typical feature unique to VMWonAWS, the ability to resize your entire cluster on the fly.
Again, above screenshots are mockups/work-in-progress
Your VMware environment is neatly wrapped up in an NSX Edge gateway, which you cannot touch. However, inside your environment, you are able to provision your own NSX networks, manage DFW, edges, etc, and with that all the functionality they offer you. However initially NSX API access will be limited or not available, so it may be hard to automate NSX actions out of the gate.
The Virtual Private Cloud (VPC) you get is divided into 2 pools of resources. Management functions are separated from compute.
Remember that all of this is bare-metal, managed and patched by VMware directly.
VMware manages the VPC with their stuff in it. Your get access to it via your own VPC, and the two are then linked together.
They give you a snazzy web frontend interface with its own API do the basic connectivity config and provisioning.
So how do you connect up your new VMWonAWS instance with your on-premises infrastructure?
End-to-end, you are bridging via Edges.. but there is obviously a little more involved. Here are the high-level steps that the customer and VMware/Amazon take to hook it all up.
The thing to remember here is that your traffic to the VMware VPC is routed through your customer VPC. Its ‘fronts’ the VMware VPC.
Link the vCenters together, and now you can use x-vmotion to move VMs back and forth. And remember, no NSX license is required on-prem to do this.
If you already have NSX, you can of stretch your NSX networks across. this allows live x-vmotions (cross-vcenter vmotion).
If you do not have NSX on-premise, you will deploy a locked-down NSX edge for bridging, but vmotions would be ‘cold’.
Encryption will be available between the Edge endpoints. No details on this yet.
As standard NSX edges are being used on both ends, you can do things like NAT, so you can do overlapping IP spaces if you so choose. That is not something native AWS VPC’s allow you to do.
Because your always have your own native AWS VPC, you can leverage any other native AWS service.
But you can do some crazy-cool things too, that will be familiar to native AWS users. You can, for example, leverage regional native AWS services, for example S3, inside VMWonAWS VMs. These resources are connected inside AWS, using their own internal routing. So this kind of traffic does not neet to go back out over the internet.
VMs inside VMWonAWS can make use of the Amazon breakout for their internet connectivity. Or you can backflow it through your own on-premises internet.
Some additional notes on APIs:
There is no backup function built into this, so you are expected to backup your own VMs hosted inside VMWonAWS. Do facilitate this, the VADP API for backups is available to leverage, as per normal.
Some notes on vSAN:
vSAN is used as underlying storage. All Flash. VMware does not yet know what the default setup of this will be in terms of FTT (failure To Tolerate_ level or dedupe. But you will have control over most of it, to decide for yourself what you want.
Its been a very interesting year so far, career wise. Late last year I figured I had a good shot at the vExpert status. I had written some in-depth blog posts, covered a lot of ground with GSS and even discovered some unique bugs (some of which I have yet to blog about), and tweeted a fair bit. But last years highlight was definitely speaking at the VMWare summer school in Utrecht, on our experiences with Metro-Cluster. I reached out to over 15 VMware employees across GSS, PSO and the NSBU and every one was willing to sponsor me, which was very nice 🙂
Having gotten to know some vExperts over the last year, one theme that kept drawing my attention was the exclusive Slack community they are given access too. Yes all the free stuff is nice of course, and you can see many a blog post commenting on that it should not all ‘be about the swag‘, but I can honestly say I don’t care all that much for that. For me its the networking that is by far the most interesting opportunity. And being somewhere lightly on the ‘spectrum’ and generally shy around people, a chat-room seemed more or less a perfect way to connect to other people with high-quality knowledge. And so it indeed has proven to be!
I was slightly worried that with over 1500 vExperts being selected this year (some people are none to happy about this), the chat would be a constant buzz of activity. Which could be good, or bad. But its actually relatively quiet most of the time. At least so far. To give you an idea, there are only about 550 people that are ‘in’ the main vExpert channel. But many of the other channels have far less participants.. vSAN: 200, NSX: 238, AWS: 100 :p Of those, 80% are lurkers… its just like IRC 😀
I have already had some very interesting discussions, and it was nice to see that even while feeling like a complete amateur amid all the ‘big names’ that frequent that chat, I still have actual valuable field experience to bring to the table, especially in regard to aspects of NSX. When the ‘Crème de la crème’ of the VMware community has next to no experience with, say, NSX load-balancing, then even lowly me can add value. And that makes me a lot less shy about participating!
Generally though, once you are part of the VMware community, and especially if you are a vExpert, that community aspect of it really starts to become important. Seeing what people are talking and posting and tweeting about, being on the inside-track of a lot of those talks, mingling with thought-leaders and VMware product owners, pushes you to become even more involved. One place where this has really ignited in me, is podcasts. I used to listed to podcasts a lot. But for the last few months that interest has revived, and its revolving around the VMware community. I now even try to take the time to attend the live recordings of several. I will dedicated a separate post about my favorite podcasts to follow.
Ok, lets talk about swag anyway :p
The most interesting things in this lineup are a year free Pluralsight subscription, 35% off VMware Press titles and advanced previews and webinars about unreleased or upcoming technology. (for a complete list of the kind of benefits, this post is good) Certain companies like Rubrik, Pluralsight and Veeam really put an effort into supporting vExperts and offer software, training and other goodies for free. It seems like you get a lot of extra benefit from visiting VMworld, not just from a ‘stuff’ perspective, but mainly from a networking angle. But unfortunately it is by no means certain I can attend every year.
As for Pluralsight, their catalog is intimidating. I am looking to get more into Docker and associated things like Kubernetes. So these will be the first things I will look to. For example Getting Started with Docker ( https://www.pluralsight.com/courses/docker-getting-started ), which is a course given by @ who has also produced a short book on Docker I highly recommend!
It seems that once you get into the vExpert community, it seems pretty straightforward to stay in it, year after year, as the momentum of participation carries you forward. Whether it be through blog posts or speaking at events, I have a feeling such things will tend to become a natural and expected part of being at ‘this level’. Lets hope I can keep it up. Well, speaking for the first time this year at the NLVMUG should certainly help 😉
This post describes our experience with upgrading from EMC VPLEX VS2 to VS6 hardware, in a seamless non-disruptive fashion.
EMC VPlex is a powerful storage virtualization product and I have had several years of experience with it in a active-active metro-storage-cluster deployment. I am a big fan. Its rock-solid, very intuitive to use and very reliable if set up correctly. Check out these 2 videos to learn what it does.
Around August 2016, EMC released VPLEX VS6, the next generation of hardware for the VPLEX platform. In many aspects it is, generally, twice as fast, utilizing the latest Intel chipset and 16Gbe FC, with an Infiniband interconnect between the directors and a boatload of extra cache.
One of our customers recently wanted their VS2 hardware either scaled-out or replaced by VS6 for performance reasons. Going for a hardware replacement was more cost-effective than scaling out by adding more VS2 engines.
Impressively the in-place upgrade of the hardware could be done none-disruptively. This is achievable through the clever way the GeoSynchrony firmware is ‘loosely coupled’ from the hardware. The VS6 hardware is a significant upgrade over the VS2, yet they are able to run the same firmware version of GeoSynchrony without the different components of VPLEX being aware of the fact. This is especially useful if you have VPLEX deployed in a metro-cluster.
So to prepare for a seamless upgrade from VS2 to VS6, your VS2 hardware needs to be on the S6 firmware. The exact same release as the VS6 hardware you will be transitioning to.
VPLEX consists of ‘Engines’ that house 2 ‘directors’. You can think of these as broadly analogous to the service processors in an array. With the main difference being that they are active-active. They share a cache and are able to handle i/o for the same LUN’s simultaneously. If you add another engine with 2 extra directors, now you have 4 directors all servicing the same workload and load-balancing the work.
Essentially the directors form a cluster together, directly over their infiniband, or in metro-cluster, also, partially, over fiber channel across the WAN. Because they are decoupled from the management plane, they can continue operating even when the management plane is temporarily not available. It also means that, if their firmware is the same, even though the underlying hardware is a generation apart, they can still form a cluster together without any of them noticing. This is what makes the non-disruptive upgrade, even in a metro-cluster configuration, possible. It also means that you can upgrade one side of the VPLEX metro-cluster separately, and a day or even a week apart from the other side. This makes planning an upgrade more flexible. There is a caveat however, and that is a possible slight performance hit on your wan-com replication between the VS2 and VS6 sides, so you don’t want to keep in that state for all too long.
VPLEX VS2 hardware. 1 engine consisting of 2 directors.
Because all directors running the same firmware are essentially equivalent, even though they might be of different hardware generations, you can almost predict what the non-disruptive hardware upgrade looks like. Its more or less the same procedure as if you where to replace a defective director. The only difference is that the old VS2 hardware is now short-circuited to the new VS6 hardware, which enables the new VS6 directors to take over i/o and replication from the old directors one at a time.
The only thing the frontend hosts and the backend storage ever notice, is temporarily losing half their storage paths. So naturally, you need to have your multipathing software on your hosts in order. This will most likely be EMC powerpath, which handles this scenario flawlessly.
The most impressive trick of this transfer, however, is that the new directors will seamlessly take over the entire ‘identity’ of the old directors. This includes -everything- unique about the director, including, crucially, the WWNs. This is important because transferring the WWNs is the very thing that makes the transition seamless. It does of course require you to have ‘soft zoning’ in place, in the case of FC. As a director port WWN will suddenly, in the space of about a minute, vanish from 1 port, and pop up on another port. But if you have your zoning set up correctly, you do not even have to touch your switches at all.
And yes, that does mean you need double cabling, at least temporarily. The old VS2 is of course connected to your i/o switches, and the new VS6 will need to be connected simultaneously on all its ports, during the upgrade process.
So have fun cabling those 😉
That might be a bit of a hassle, but its a small price to pay for such a smooth and seamless transition.
To enable the old VS2 hardware, (which used FC to talk to his partner director over local-com), to talk to the new VS6 directors (which use Infiniband) during the migration, it is necessary to temporary insert an extra FC module into the VS6 directors. During a specific step in the upgrade process, the VS2 is connected to the VS6, and for a brief period, your i/o is being served from a combination of a VS2 and VS6 director that are sharing volumes and cache with eachother. This is a neat trick.
Inserting the temp IO modules:
As a final step, the old VS2 management server settings are imported to the new redundant VS6 management modules. In VS6, these management modules are now integrated into the director chassis, and act in a active-passive failover mode. This is a great improvement over the single on-redundant VS2 management server, with its single power supply (!)
New management modules:
The new management server hardware completely takes over the identity and settings of the old management server. This even includes IP address, customer cluster names and the cluster serial numbers. The VS6 will adopt the serial numbers of your VS2 hardware. This is important to know from a EMC support point-of-view and may confuse people.
The great advantage is that all local settings and accounts, and all monitoring tools and alerting mechanisms flawlessly work with the new hardware. For example we have a powershell script that uses the API to check the health status. This script worked immediately with the VS6 without having to change anything. Also VIPR SRM only need a restart of the VPLEX collector,whereafter it continued collecting without having to change anything. The only thing I have found that did not get transferred where the SNMP trapping targets.
After upgrade, the benefit of the new VS6 hardware was immediately noticeable. Here is a graph of average aggregate director CPU use, from EMC VIPR SRM:
As this kind of product is fundamental to your storage layer, its stability and reliability, especially during maintenance work like firmware and hardware upgrades, is paramount, and is taken seriously by EMC. Unlike other EMC products like VNX, you are not expected or indeed allowed to update this hardware yourself, unless you are a certified partner. Changes that need to be done to your VPLEX platform go through a part of EMC called the ‘Remote Pro-active’ team.
There is a process that has to be followed which involves getting them involved early, a round of pre-validation health-checks, and the hands-on action of the maintenance job either remotely via webex, or on sight by local EMC engineers if that is required. A hardware upgrade will always require onsite personal, so make sure they deliver pizza to the datacenter! If an upgrade goes smoothly, expect it to take 4-5 hours. That includes all the final pre-checks, hardware work, cabling, transfer of management identity to the VS6, and decommissioning of the VS2 hardware.
In the end the upgrade was a great success, and our customer had zero impact. Pretty impressive for a complete hardware replacement of such a vital part of your storage infra.
Finally, here is the text of the September2016 VPLEX Uptime bulletin with some additional information about the upgrade requirements. Be aware that this may be deprecated, please consult with EMC support for the latest info.
There is an EMC community thread where people have been leaving their experiences with the upgrade, have a look here: https://community.emc.com/message/969664
Among all the great new features and improvements made to vSphere 6.5, some of the ones I am most exited about are the improvements to DRS and HA. So lets zoom into those briefly.
This information comes mostly from VMware pre-sales marketing material and should be considered preliminary. I hope to try out some of these features in our lab once the bits become available.
vCenter Server Appliance (VCSA) now supports a HA mode + Witness.
This appears to be similar in some respects to the NSX Edge HA function. But with one seriously important addition: a witness.
In any High-Availability, clustering or other kind of continuous-uptime solution, where data integrity or ‘state’ is important, you need a witness or ‘quorum’ function to determine which of the 2 HA ‘sides’ becomes the master of the function, and thus may make authoritative writes to data or configuration. This is important if you encounter the scenario of a ‘split’ in your vSphere environment, where both the HA members could become isolated from each other. The witness helps decide which of the 2 members must ‘yield’ to the other. I expect the loser turns its function off. The introduction of a witness also helps the metro-cluster design. In case of a metro-cluster network split, the witness now makes sure you cannot get a split-brain vcenter.
The HA function uses its own private network with dedicated adapter, that is added during configuration. There is a basic config and an advanced option to configure. I assume the latter lets you twiggle the nobs a bit more.
There are some caveats. At release this feature only works if you are using an external Platform Services Controller. So assume this will not work if you run all the vSphere functions inside 1 appliance. At least not at GA.
It should be noted that the new integrated vSphere Update Manager for the VCSA, will also failover as part of this HA feature.It should also be noted that this feature is only available in Enterprise+
Simplified HA Admission Control
vSphere 6.5 sees some improvements to HA admission control. As with many of the vSphere 6.5 enhancements, the aim here is to simplify or streamline the configuration process.
The various options have now been hidden under a general pulldown menu, and combine with the Host Failures Cluster Tolerates number, which now acts as input to whatever mode you select. In some ways this is now more like the VSAN Failures To Tolerate setting. You can of course, still twiggle the knobs if you so wish.
Additionally to this, the HA config will give you a heads up if it expects your chosen reservation with potentially impact performance while doing HA restarts. You are now also able to guard against this by reserving a resource percentage that HA must guarantee during HA restarts. These options give you a lot more flexibility.
Admission control now also listens to the new levels of HA Restart priority, where it might not restart the lowest levels if they would violate the constraints. These 2 options together give you great new flexibility in controller the HA restart and the resources it takes (or would take).
vShere HA Restart Priorities
At long last, vSphere now supports more than 3 priority-levels. This adds a lot more flexibility to your HA design. In our own designs, we already assigned infrastructure components to the previous ‘high’ level, customer production workloads to ‘medium’ and everything else to ‘low’. What I was missing at the time was differentiate between the Infra components. For example, I would want Active Directory to start -before- many other Infra services that rely on AD authentication. Syslogging is another service you want to get back up as soon as possible. And of course vCenter should ideally come back before many other VMware products that rely on it. Also allows you to make some smart sequencing decisions in regard to NSX components. I would restart NSX controllers and the Edge DLR and Edge tenant routers first, for example. I am sure you can think of your own favorite examples.
As mentioned previously, these new expanded restart levels go hand-in-hand with the new admission control options.
vSphere HA Orchestrated Restart
This is another option that I have wanted to see for a very long time. I have seen many HA failover in my time, and always the most time is spent afterwards by the application owners, putting the pieces back together again cause things came up in the wrong order.
vSphere Orchestrated Restart allows you to create VM dependency rules, that will allow a HA Failover to restart the VMs in the order that best serves the application. This is similar to the rule sets we know from SRM.
Naturally you will need to engage your application teams to determine these rules. I do wonder about the limits here. In some of the environments we manage, there could potentially be hundreds of these kinds of rules. But you don’t want to make it too hard for HA to calculate all this, right?
This is a ‘new’ feature, in so far that that this is a new deeper level of integration natively to vCenter, and can leverage the new ‘quarantine mode’ for ESX hosts. Similar behavior has already for years been a feature of the Dell Management Plug-in for vCenter, for example; where ‘maintenance mode’ action was triggered as script action from a vCenter alert. By leveraging ‘quarantine mode’ , new modes of conduct are enabled in dealing with partially failed hosts, for example pro-actively migrating off VMs, but based on specific failure rules, instead of an all-or-nothing approach.
For years we have only ever had 2 possible host states: Maintenance and.. well, not in maintenance 🙂
Quarantine Mode is is the new middle ground. It can be leverages tightly with the new proactive HA feature mentioned above and integrates with DRS, but is above all just a useful mode to employ operationally.
The most important thing to bare in mind, is that Quarantine mode does not by default guarantee that VMs cannot or will not land on this host. An ESH host in quarantine can and will still be used to satisfy VM demand where needed. Think of reservations and HA failover. DRS, however, will try to avoid placing VMs on this host if possible.
Operationally, this is very similar to what we would already do in many ‘soft’ failure scenarios for hosts: – we will put DRS to semi-auto, and slowly start to evacuate the host, usually ending up putting it in maintenance at the end of the day.
DRS Policy Enhancements
Again more streamlining. For us vSphere admins with a case of OCD, the new ‘even distribution’ model is quite relaxing. VMware describes this, endearingly, as the ‘peanut butter’ model. Personally I will refer to it as the Nutella model, because Nutella is delicious!
This of course refers to the ‘even spread’ of VMs across all hosts in your cluster.
This, and the other options added to DRS, are interesting from both a performance and a risk point-of view. You avoid the ‘all your eggs in one basket’ issue, for example. Naturally the CPU over-commitment setting is especially interesting in VDI environments, or any other deployment that would benefit from good continuous CPU response.
DRS will now attempt to balance load based also on the network saturation level of host, besides only looking at CPU and RAM. However it will prioritize CPU and RAM above all else. This is on a best-effort basis so no guarantees.