On ESX 5.5U3, I recently ran into an annoying issue with HA. vSphere had recently been updated, but the hosts had not been all yet received the very latest version of the FDM (fault domain manager, aka ‘HA’) agent.
During some routine maintenance work, a particular host was taken in and out of maintenance mode a few times. Eventually it was observed to no longer properly complete HA configuration. Checking the host status in the UI, it would seemingly get stuck in the install phase of the latest FDM agent.
Checking the FDM installer log ( /var/run/log/fdm-installer.log ) , I found the following:
fdm-installer:  2016-08-25 11:16:13: Logging to /var/run/log/fdm-installer.log
fdm-installer:  2016-08-25 11:16:13: extracting vpx-upgrade-installer/VMware-fdm-eesx-2-linux-4180647.tar
 2016-08-25 11:16:13: exec rm -f /tmp/vmware-root/ha-agentmgr/upgrade
 2016-08-25 11:16:13: status = 0
 2016-08-25 11:16:13: exec cd /tmp/vmware-root/ha-agentmgr/vpx-upgrade-installer
 2016-08-25 11:16:13: status = 0
fdm-installer:  2016-08-25 11:16:13: Installing the VIB
fdm-installer:  2016-08-25 11:16:18: Result of esxcli software vib install -v=/tmp/vmware-root/ha-agentm
fdm-installer: Error in running rm /tardisks/vmware_f.v00:
fdm-installer: Return code: 1
fdm-installer: Output: rm: can’t remove ‘/tardisks/vmware_f.v00’: No such file or directory
fdm-installer: It is not safe to continue. Please reboot the host immediately to discard the unfinished update.
fdm-installer: Please refer to the log file for more details.
fdm-installer:  2016-08-25 11:16:18: There is a problem in installing fdm vib. Remove the vib…
 2016-08-25 11:16:18: exec esxcli software vib remove -n=vmware-fdm.vib
No VIB matching VIB search specification ‘vmware-fdm.vib’.
Please refer to the log file for more details.
 2016-08-25 11:16:19: status = 1
fdm-installer:  2016-08-25 11:16:19: Unable to install HA bundle because esxcli install return 1
This was decidedly odd. I checked the /tardisks mount, and could, indeed, not found any vmware_f.v00 file. It was trying to ‘remove’ (unmount, as it turns out) a file that did not exist. And this was breaking the uninstall process.
This page was useful in understanding the sequence of events: http://vm-facts.com/main/2016/01/23/vmware-ha-upgrade-agent-issue-troubleshooting/
What I can only speculate as to what happened, is that at some point in the sequence of taking the host in and out of maintenance, the FDM uninstall somehow failed to complete properly, and left the host image list in a strange, invalid state.
Querying the host in this state, it listed the old FDM agent as still installed:
# esxcli software vib list | grep -i fdm
vmware-fdm 5.5.0-3252642 VMware VMwareCertified 2016-02-03
Yet a force uninstall of the VIB would fail with the same error.
fdm-uninstaller:  2016-08-24 11:42:30: exec /sbin/esxcli software vib remove -n=vmware-fdm
Message: Operation finished successfully.
Reboot Required: false
VIBs Removed: VMware_bootbank_vmware-fdm_5.5.0-3252642
fdm-uninstaller:  2016-08-24 11:43:58: status = 1
fdm-uninstaller:  2016-08-24 11:43:58: exec /sbin/chkconfig –del vmware-fdm
Together with VMware support, we tried various tricks, including copying a fresh imgdb.tgz from a different host to /bootbank , and running the latest installer and uninstaller of the FDM agent manually.
By the way, the source that vCenter uses for the FDM agent installer and uninstaller, is (on Windows) “Program Files\VMware\Infrastructure\VirtualCenter Server\upgrade”
If you wish to try to run these files directly on an ESX host, simply copy them to the host /tmp and chmod them to 777. They are then executable.
But in all cases, the FDM installer first will always try an uninstall of the previous verison, which always includes trying to unmount /tardisks/vmware_f.v00
Now /tardisks is a bit of a strange bird, and deserves some explanation. This VMware research paper turned out to be a very excellent document in understanding what /tardisks actually is and does: https://labs.vmware.com/vmtj/visorfs-a-special-purpose-file-system-for-efficient-handling-of-system-images
In short, it is a directory that hosts mounted TAR files, that are loading at boot time from /bootbank (or /altbootbank). These TAR files are mounted as live filesystems, using what VMware calls VisorFS. Which makes the mounted TAR files behave as part of the file system. This has various administrative and management advantages as the paper linked above explains.
It is therefore not possible to simply copy a missing file to /tardisks in order to force the FDM uninstaller to properly complete.
You can list which TAR filesystems ESX has mounted, by running the command esxcli system visorfs tardisk list
This list will be the same as the filelist of /tardisks
Of note: when you re-install FDM, just after the install, the ‘system’ flag will be set to false, until you reboot. After a reboot, it will be set to true like all other modules.
On a normal host, you will find the FDM VIB listed here.
In our case, this entry was missing, even though the Vib list command showed it as installed.
So it seemed to me that if ESX needed to mount these TAR files at boot time, there was probably a command it used to do this.
Or in any case, I found it likely such a command should exist, if only for troubleshooting purposes.
I wondered that if I could mount this TAR manually, the uninstaller might proceed normally.
A few minutes of google-fu later, I stumbled on this:
Creating and Mounting VIB files in ESXi
Now the VMware engineer noted that the vmkramdisk command has been deprecated since 4.1, but to both our surprise (and delight) it was still there in 5.5, and still did its job.
We manually mounted the /bootbank/vmware-f.v00 using the command vmkramdisk /bootbank/vmware-f.v00
Immediately you will find vmware-f.v00 listed under /tardisks, and using esxcli system visorfs tardisk list
And as predicted, the installed passed through the uninstall this time, without a hitch, and then installed the new version of the HA agent. We rebooted the host just to be sure it would properly load the new VIB each time. And it did, and managed to initiate HA in the cluster without any issues thereafter.