Archive for the ‘Troubleshooting’ Category

Solaris 11 on ESX – Serialized Disk IO bug causes extreme performance degradation #vexpert

Wednesday, March 29th, 2017

In this post, I discuss a newly found performance bug in Solaris 11, that has since Solaris 11 came out in 2011, severely hampered ESX VM disk i/o performance when using the LSI Logic SAS controller. I show how we identified the issue, what tools were used, and what the bug actually is.

In Short:

A bug in the disk controller driver ‘mpt_sas’ as used in Solaris 11, as used by the VMware virtual machine ‘LSI Logic SAS’ controller emulation, was causing disk I/O to only be handled up to 3 i/o at a time.

This causes severe disk i/o performance degradation on all versions of Solaris 11 up to the patched version. This was observed on Solaris 11 VMs on  vSphere 5.5u2, but has not been tested on any other vSphere version.

The issue was identified by myself and Valentin Bondzio of VMware GSS, together with our customer, and eventually Oracle. Tools used: iostat, esxtop, vscsiStats

The issue was patched in patch# 25485763 for Solaris 11.3.17.5.0, and in Solaris 12

Bug Report ( Bug 24764515 : Tagged command queuing disabled for SCSI-2 and SPC targets  ) : https://pastebin.com/DhAgVp7s

Link to Oracle Internal

KB Article: (Solaris 11 guest on VMware ESXI submit only one disk I/O at a time (Doc ID 2238101.1) ) : https://pastebin.com/hwhwiLRM

Link to Oracle Internal

————————

TLDR below:

(more…)

Using vmkramdisk to fix rare HA-FDM Agent tardisks vmware_f.v00 uninstall issue

Tuesday, September 6th, 2016

On ESX 5.5U3, I recently ran into an annoying issue with HA. vSphere had recently been updated, but the hosts had not been all yet received the very latest version of the FDM (fault domain manager, aka ‘HA’) agent.
During some routine maintenance work, a particular host was taken in and out of maintenance mode a few times. Eventually it was observed to no longer properly complete HA configuration. Checking the host status in the UI, it would seemingly get stuck in the install phase of the latest FDM agent.

Checking the FDM installer log ( /var/run/log/fdm-installer.log ) , I found the following:

—————————————————————-
fdm-installer: [40283] 2016-08-25 11:16:13: Logging to /var/run/log/fdm-installer.log
fdm-installer: [40283] 2016-08-25 11:16:13: extracting vpx-upgrade-installer/VMware-fdm-eesx-2-linux-4180647.tar
[40283] 2016-08-25 11:16:13: exec rm -f /tmp/vmware-root/ha-agentmgr/upgrade
[40283] 2016-08-25 11:16:13: status = 0
[40283] 2016-08-25 11:16:13: exec cd /tmp/vmware-root/ha-agentmgr/vpx-upgrade-installer
[40283] 2016-08-25 11:16:13: status = 0
fdm-installer: [40283] 2016-08-25 11:16:13: Installing the VIB
fdm-installer: [40283] 2016-08-25 11:16:18: Result of esxcli software vib install -v=/tmp/vmware-root/ha-agentm
fdm-installer: Error in running rm /tardisks/vmware_f.v00:
fdm-installer: Return code: 1
fdm-installer: Output: rm: can’t remove ‘/tardisks/vmware_f.v00’: No such file or directory
fdm-installer:
fdm-installer: It is not safe to continue. Please reboot the host immediately to discard the unfinished update.
fdm-installer: Please refer to the log file for more details.
fdm-installer: [40283] 2016-08-25 11:16:18: There is a problem in installing fdm vib. Remove the vib…
[40283] 2016-08-25 11:16:18: exec esxcli software vib remove -n=vmware-fdm.vib
[NoMatchError]
No VIB matching VIB search specification ‘vmware-fdm.vib’.
Please refer to the log file for more details.
[40283] 2016-08-25 11:16:19: status = 1
fdm-installer: [40283] 2016-08-25 11:16:19: Unable to install HA bundle because esxcli install return 1

—————————————————————-

This was decidedly odd. I checked the /tardisks mount, and could, indeed, not found any vmware_f.v00 file. It was trying to ‘remove’ (unmount, as it turns out) a file that did not exist. And this was breaking the uninstall process.

This page was useful in understanding the sequence of events: http://vm-facts.com/main/2016/01/23/vmware-ha-upgrade-agent-issue-troubleshooting/

What I can only speculate as to what happened, is that at some point in the sequence of taking the host in and out of maintenance, the FDM uninstall somehow failed to complete properly, and left the host image list in a strange, invalid state.

Querying the host in this state, it listed the old FDM agent as still installed:

————-
# esxcli software vib list | grep -i fdm
vmware-fdm                     5.5.0-3252642                       VMware  VMwareCertified   2016-02-03
————-

Yet a force uninstall of the VIB would fail with the same error.

————————
fdm-uninstaller: [] 2016-08-24 11:42:30: exec /sbin/esxcli software vib remove -n=vmware-fdm
Removal Result
Message: Operation finished successfully.
Reboot Required: false
VIBs Installed:
VIBs Removed: VMware_bootbank_vmware-fdm_5.5.0-3252642
VIBs Skipped:
fdm-uninstaller: [] 2016-08-24 11:43:58: status = 1
fdm-uninstaller: [] 2016-08-24 11:43:58: exec /sbin/chkconfig –del vmware-fdm
———————-

Together with VMware support, we tried various tricks, including copying a fresh imgdb.tgz from a different host to /bootbank , and running the latest installer and uninstaller of the FDM agent manually.
By the way, the source that vCenter uses for the FDM agent installer and uninstaller, is (on Windows) “Program Files\VMware\Infrastructure\VirtualCenter Server\upgrade”

If you wish to try to run these files directly on an ESX host, simply copy them to the host /tmp and chmod them to 777. They are then executable.

But in all cases, the FDM installer first will always try an uninstall of the previous verison, which always includes trying to unmount /tardisks/vmware_f.v00

Now /tardisks is a bit of a strange bird, and deserves some explanation. This VMware research paper turned out to be a very excellent document in understanding what /tardisks actually is and does: https://labs.vmware.com/vmtj/visorfs-a-special-purpose-file-system-for-efficient-handling-of-system-images

In short, it is a directory that hosts mounted TAR files, that are loading at boot time from /bootbank (or /altbootbank). These TAR files are mounted as live filesystems, using what VMware calls VisorFS. Which makes the mounted TAR files behave as part of the file system. This has various administrative and management advantages as the paper linked above explains.

It is therefore not possible to simply copy a missing file to /tardisks in order to force the FDM uninstaller to properly complete.

You can list which TAR filesystems ESX has mounted, by running the command  esxcli system visorfs tardisk list

 

This list will be the same as the filelist of /tardisks

Of note: when you re-install FDM, just after the install, the ‘system’ flag will be set to false, until you reboot. After a reboot, it will be set to true like all other modules.

On a normal host, you will find the FDM VIB listed here.

In our case, this entry was missing, even though the Vib list command showed it as installed.

So it seemed to me that if ESX needed to mount these TAR files at boot time, there was probably a command it used to do this.
Or in any case, I found it likely such a command should exist, if only for troubleshooting purposes.
I wondered that if I could mount this TAR manually, the uninstaller might proceed normally.
A few minutes of google-fu later, I stumbled on this:
Creating and Mounting VIB files in ESXi

Now the VMware engineer noted that the vmkramdisk command has been deprecated since 4.1, but to both our surprise (and delight) it was still there in 5.5, and still did its job.

We manually mounted the /bootbank/vmware-f.v00 using the command vmkramdisk /bootbank/vmware-f.v00

Immediately you will find vmware-f.v00 listed under /tardisks, and using esxcli system visorfs tardisk list

And as predicted, the installed passed through the uninstall this time, without a hitch, and then installed the new version of the HA agent. We rebooted the host just to be sure it would properly load the new VIB each time. And it did, and managed to initiate HA in the cluster without any issues thereafter.

 

Hidden in the release notes: VMware Engineering PR Numbers

Thursday, February 4th, 2016

We all know about the VMware case numbers. Each SR you open gets a nice number.

Internally, VMware has a problem database. Newly found bugs end up in there. And if you spend a lot of time with VMware support, you will end up hearing a lot about these internal PR (problem reports).

Here is a cool fact you may not know. Hidden in the HTML source of the public release notes that VMware produces, are the actual PR numbers associated with the issue that is described as having been fixed. (or not fixed).

Take the NSX 6.2.0 release notes for example: https://www.vmware.com/support/nsx/doc/releasenotes_nsx_vsphere_620.html

View the source:

And if you scroll down to the fixes, you will find:

Its those DOCNOTE numbers, that are the actual PR numbers. Sometimes they also list the public KB number too.  But there are far more interal PR numbers than there are public KB equivalents.

So how can this help you?

Well for one thing, you can start asking intelligent questions of VMware support, like: ‘Has something like my issue been reported before in the PR database?’ (prompting the engineer to go look for it, which they don’t always do on their own accord 😉
Or you can use it as a validation check. If your issue is scheduled to be fixed in an upcoming patch, ask the support engineer for the associated PR numbers! That way, you can verify yourself in the release notes, if the fix was included!
The process of getting a new patch or update through QA is quite involved, and sometimes fixes fall by the wayside. This is not immediately known to everyone inside VMware. So its always worth checking yourself; trust but verify.

 

 

Lenovo T500 Wont PXE boot

Friday, May 7th, 2010

One of our batch of Lenovo T500 laptops refuses to boot from our Windows Deployment PXE server.
So not easy to get our default company image into it. Hard to explain to Lenovo support what is wrong cause its so very specific. All other network functions are just fine. My colleague ran a trace on the server side, and the Laptop doesn’t even seem to contact it.

If I had to guess I would say its network driver issue and this laptop has a very specific revision of network board .. or something. 

It seems I am not the only one running into PXE boot issues with a Lenovo.

Don’t really have that much time to dig deeply into it. I would like to see what the conversation is between our WDS server, the DHCP server and the laptop. Something wierd going on there.

“Multiple Connections” error when importing a Physical machine into VMware

Wednesday, November 4th, 2009

 

As many before us, we ran into the following error in Virtual Center, when we tried to P-to-V a server:

“multiple connections to a server or shared resource by the same user”

This is not an uncommon error, and you might recognise it from other scenarios that involve remotely connecting to a Windows host. 
I found quite a few posts that mentioned this problem, even a mention in the release notes of VMWare Converter itself, and an VMware knowledge base article,  and they go something like this:

An Inter-Process Communication (IPC) named-pipe network connection is already open from the local Windows Redirector to the remote host with different credentials than you are specifying in Converter. The Windows Redirector does not allow IPC connections with multiple different credentials from the same user session. This restriction also applies to mapped network drives as they are also a type of named-pipe connection.

To ensure the Converter agent connection succeeds, perform the following actions on the computer running Converter:

  1. Close any application views or Windows Explorer windows showing files, ActiveX components, or Microsoft Management Console (MMC) snap-ins from the server you are trying to convert.
  2. Open a command prompt. For more information, see Opening a command or shell prompt (1003892).
  3. Type net use \\<remote_hostname>\ * /delete and press Enter.

    Note
    : This disconnects any mapped drives to the remote host.

  4. Check My Computer for any mapped network drives to the remote host and disconnect them.
  5. Log off the server running Converter and log on again.  This disconnects an open IPC named-pipe connections established by any remaining applications.
  6. If the problem persists, restart the server running Converter.
  7. If the problem still persists, and you are using the VirtualCenter Converter plug-in, restart the VirtualCenter server.

So we tried all the above, but to no avail. Try as we may, from both our admin workstations, aswell as on the Virtual Center server itself, we could not get it to run.

In the end, my collegue tried to run the task using the IP adress of the server, itstead of its hostname!

That did the trick! But don’t ask me why! I suspect it has something to with the named-pipe actually being named differently when you do this.