Migrated to #vSphere 6.5 into an unsupported #SSO topology? – This is how we got out of it! #vExpert
We recently migrated a number vCenter installs from 5.5 on Windows, to the 6.5U1 VCSA
These installs where in pairs, 1 vCenter per datacenter/MER, each their own seperate SSO install (on the same server).
We did not believe at the time, that these SSO installations had any relationship with each other. The reason we thought this, is because the vCenters were not in linked mode with each other.
Unless you specifically checked the SSO config, you cannot easily notice if they are, in fact, both in the same SSO domain. It turned out, of course, this was the case.
When you have 2 embedded SSO installs on 2 windows servers, and they are both in the same domain, this is a supported topology in vSphere 5.5, but crucially, this is not supported in vSphere 6.5U1! In other words, if you have 2 VCSA's, they embedded SSO's may not point to eachother in 6.5. Instead you are to use external, dedicated PSC appliances in this case.
You can find an overview of the supported toplogies here: https://kb.vmware.com/s/article/2147672
Ironically enough, embedded linked SSO instances are once again supported... in 6.7. For more information on that, see this excellent post: https://virtualtassie.com/2018/vcenter-6-7-embedded-linked-mode/
What happened with us?
We ourselves, did not have the wherewithal to manually check the SSO config prior to migration. We simply assumed that because we didn't see Linked mode, that the installs where separate on all levels. If we had been more careful, we could have used the ssolscli.cmd tool to check the current SSO config. This is detailed on William Lam's blog here
When we migrated the first vCenter, the pre-migration checks done by the assistant did not alert us to the fact that SSO was linked to another SSO. There is nothing in the Migration Assistant or the Migration Tool that shows you the topology it has found in place. There appears to be no check against supported or unsupported configurations.
At some point, the Migration tools makes an interesting decision. Having found SSO being in a multi-site, single domain setup, it seemingly desided to turn on Enhanced Linked Mode, between the 2 vCenters that where both members of the domain. Even though previously, there was no linked mode between them.
Once we migrated the first vCenter1, in the new 6,5 webclient UI, it started complaining that it was missing the other vCenter2. Once the other vCenter had also been migrated, they both, very neatly, appeared to be in a Linked Mode relationship with each other.
This was unexpected behavior, and various folks at VMware have expressed their surprise to me that the migration tool did this.
How did we get out of it and return to a supported config?
We raised an SR with GSS, but it quickly became clear they would not be able to help us, as the topology was unsupported. (and they did not have a procedure in place to fix this)
The best they could do is refer us to a post by William Lam that describes how to split up SSO when you have external PSC's. We had a look at this procedure, but it did not work for us as we could not get the cmsso-util tool to work. It would report to us that 'Could not find a host id which maps Hostname to in Component Manager Failed!!!'
Also the procedure is different because we had embedded PSC. So that means the step would involve taking the complete VCSA off the network, one after the other.
We needed to accomplish 2 things: A. Break the SSO link so that we have 2 separate, independent SSO instances again. Then, break the linked mode between the vCenters. All this while making sure that the vCenters where still locally pointing to their own, embedded SSO.
So we ended up executing the following steps.
The following steps are NOT supported by VMware. You Do this at your own risk. Make sure you have backups and/or snapshots in place. Each vCenter will have to be off the network briefly, so take that into account. Also, you will be left with some loose ends that you may not like, see bottom of this post.
1. Double check the current SSO config using the dir-cli command. You can clearly see the fact that 2 seperate SSO nodes are listed. This view should be identical on both VCSA's as they both have the same view of the SSO Domain (aka the 'federation').
Another way to check this is with the vdcrepadmin tool. It will show you the SSO replication partners. (also see: https://kb.vmware.com/s/article/2127057 )
2. So this is where it gets interesting. You will now take one of your 2 vCenters completely off the network. The idea is actually to make one PSC node not be able to talk to the other, so that we can force-break the federation between the two SSO nodes. I suppose you could accomplish this with a clever firewall rule also. But for all practical purposes, its probably easier to just disconnect the entire VCSA from the network, briefly. That is what we did.
- So you take vCenter2 of the network, then force-break SSO on vCenter1
- Then Reconnect vCenter2, then you take vCenter1 of the network, and force break SSO on vCenter2
Get it? 🙂
Double check you really cant reach the vCenter when you take it offline.
3. Now, on the vCenter that is still online, we shall break the SSO federation using the vdcleavefed command
If it is useful to you, the manual for vcdleavefed is here
Another example of where this command has been used by others for similar work is here and here
And same as with some other people, the reason we are using vdcleavefed and not cmsso-util, is because also in our case, we would get the error "Could not find a host id which maps Hostname to in Component Manager Failed!!!"
On vCenter1, we will now force-break its relationship with vCenter2
/usr/lib/vmware-vmdir/bin/vdcleavefed -h <vcenter2 name> -u administrator
Run like this, the command will explicitly aks for the password. This is the default SSO admin user, firstname.lastname@example.org
When we tried passing the password on the command line, for some reason it didn't like it.
4. Now take vcenter1 off the network, and reconnect vcenter2
Now you do the same procedure on vCenter2, again, make sure that before you run the command, it cant talk to vCenter1
5. After you have run the command on vCenter2, you can connect it back on the network. Now you can use the dir-cli command to check the relationship again. You should now only see 1 entry on each VCSA, only itself. Or rather.. its own local PSC/SSO instance.
6. Now that SSO has been severed in twain , we need to cut up the Linked_Mode association between the 2 vCenters. Or rather, on each vCenter, we shall unregister the other. We do this in the configuration of the Inventory Service.
For this we shall use the inventory service tool (lstool)
We use the following command to export the current vcenter registrations to a file (or just the console output, as you prefer). We need this to determine which service ID the local Inventory Service has assigned to the linked vCenter
This ID is unique and local to every Inventory Service. So again we will need to do this procedure separately on both vCenters. The VCSA's can remain on the network for this.
/usr/lib/vmidentity/tools/scripts/lstool.py list --url http://localhost:7080/lookupservice/sdk --type vcenterserver > /tmp/inventory_service_01.txt
The output of this file will look at follows:
On the vcenter2, you will find a similar entry, for vcenter1:
Make a note of the Service ID
7. Then we shall use the following command to unregister this service from the Inventory service.
--==Be very careful here, that you don't choose the incorrect service ID. You dont wan't to accidentally unregister the LOCAL vCenter.==--
/usr/lib/vmidentity/tools/scripts/lstool.py unregister --url http://localhost:7080/lookupservice/sdk --id 374426C9-5B71-4140-AD75-5DFAD128BC27 --user "email@example.com" --password ‘<password in single quotes>' --no-check-cert
Note: password here should be put inside single quotes '_'
The output of lstool.py command wont help you much here. All it outputs to the screen is just the restart of the Inventory Service. So you can't tell from the output if the unregister was successful or not. To determine this, instead run the 'list' command again ( step 6 )
8. Reboot both VCSA's
You should now find, that both vCenters no longer have linked mode registrations to their counterparts. Linked mode is now broken. On the linked mode tab, the other vCenter has vanished, and they no longer appear in each-others inventory tree.
9. And that is it!
But, this procedure will leave you with at least 1 issue; Invalid License Manager asset references.
This is more an inconvenience than anything else. We chose not to fix this at this time, that is why I don't currently have a procedure to fix it. Perhaps someone else made one? But I may dedicate a future post to this.
Don't get into this situation in the first place. Make sure you validate and check all aspects of your 5.5 enviroment before upgrading, and this includes checking the SSO topology.
Having said that, I think it is a shame that the 6.5 migration tool or migration assistant does no pre-check on the current topology, no validation if its a supported upgrade path. Again, check this yourself against this page: https://kb.vmware.com/s/article/2147672
Having discussed this with several VMware folks, they seem surprised by this default behavior of the migration tool. Some questions are being asked internally, and perhaps more insight will emerge soon, if it does I will update this post.