Solaris 11 on ESX – Serialized Disk IO bug causes extreme performance degradation #vexpert

In this post, I discuss a newly found performance bug in Solaris 11, that has since Solaris 11 came out in 2011, severely hampered ESX VM disk i/o performance when using the LSI Logic SAS controller. I show how we identified the issue, what tools were used, and what the bug actually is.

In Short:

A bug in the disk controller driver 'mpt_sas' as used in Solaris 11, as used by the VMware virtual machine 'LSI Logic SAS' controller emulation, was causing disk I/O to only be handled up to 3 i/o at a time.

This causes severe disk i/o performance degradation on all versions of Solaris 11 up to the patched version. This was observed on Solaris 11 VMs on  vSphere 5.5u2, but has not been tested on any other vSphere version.

The issue was identified by myself and Valentin Bondzio of VMware GSS, together with our customer, and eventually Oracle. Tools used: iostat, esxtop, vscsiStats

The issue was patched in patch# 25485763 for Solaris 11.3.17.5.0, and in Solaris 12

Bug Report ( Bug 24764515 : Tagged command queuing disabled for SCSI-2 and SPC targets  ) : https://pastebin.com/DhAgVp7s

Link to Oracle Internal

KB Article: (Solaris 11 guest on VMware ESXI submit only one disk I/O at a time (Doc ID 2238101.1) ) : https://pastebin.com/hwhwiLRM

Link to Oracle Internal

------------------------

TLDR below:

Very rarely do most IT professionals get to discover truly new bugs. Especially rare is when they do so in operating systems they are extremely unfamiliar with. Because surely, as the thinking goes, someone else must have found 'this' before, right? Considering the complexity and the sheer diversity of the IT landscape before us today, I have learnt to no longer make that assumption. Still, I had never imagined to find a new and unique bug, with such impacting consequences, in an operating systems as obscure (at least, for me) as Solaris.

This issue was identified about a year ago (April 2016) when troubleshooting an unrelated ESX performance issue on Solaris.

We were running Oracle Database performance tests on top of Solaris 11.2, trying to push CPU performance as hard as we could. This test would also hit the underlying storage very hard. But for some reason, we found CPU was flattening out way before 100% utilization.

So we immediately suspected a storage bottleneck

We decided to move the VM to the fastest SSD tier we had.

We then ran the test again , and this time I monitored the underlying storage performance very closely and ran into the same issue.

In Solaris 11 , the IOStat command showed extremely high wsvc_t times. From the man:

wsvc_t
average service time in wait queue, in milliseconds

svc_t
average response time of transactions, in milliseconds

The svc_t output reports the overall response time, rather than the service time, of a device. The overall time includes the time that transactions are in queue and the time that transactions are being serviced. The time spent in queue is shown with the -xoption in the wsvc_t output column. The time spent servicing transactions is the true service time. Service time is also shown with the -x option and appears in the asvc_t output column of the same report.

However, the underlying hardware showed no physical i/o being bottlenecked anywhere. I checked this against the ESX hypervisor and to be extra sure, checked the underlying physical storage, in our customers case, EMC VPLEX on top of EMC VNX. But neither ESX nor the physical hardware showed signs of even receiving any kind of significant storage load.

 

So here 'iostat' in the guest OS, and esxtop on ESX, were giving contradictory results. It was clear from ESX top that the i/o's where being bottlenecked somewhere above the layer of the hypervisor. So that left several possible things in the guest OS; filesystem, caching effects, controller driver, etc.

But how were things being bottlenecked? This is where vscsiStats really helped.

The outstanding i/o histogram (-p option) very clearly showed what the problem was.

An example healthy VM (redhat or Solaris 10):

A Solaris 11 VM:

The disk i/o was being offered to the virtual disk device in a synchronous fashion. Just 1 i/o at a time. This was how it was being received from the guest OS, proving the issue was inside Solaris 11 somewhere.

We confirmed that in all our disk tests, using Oracle ASM, the i/o was indeed being provided asynchronously to the Solaris 11 guest OS. This ruled out an application issue.

It was now clear we had to open a case with Oracle for a possible Solaris 11 issue.

As Oracle has a bit of a.. shall we say.. sketchy support policy around VMware VMs, we were not sure how much headway we could make. I made sure we built a rock-solid case first using the investigations above. Then I made a nice and clear PDF document that summarized the key points we where trying to make. The screenshots in this post are from that document I prepared. I cannot stress how helpful a reference case example in an immutable document format can be in this kind of support process. You can always reference back to it.

The case we opened was 'Severity 3SR 3-13298414671 : Disk IO from a Solaris VM in VMware ESXI turns out to be synchronous even though it is asynchronous'.
It took a while (several engineers, and a conf call together with VMware GSS) for the case to land with the correct support team inside Oracle; the 'storage drivers group'.

We ended up with Remi Colinet, who was excellent and extremely knowledgeable, and fast to identify where the issue was.
First they confirmed that both Solaris 10 and 11 were indeed both using the same controller driver, 'mpt_sas' , and not, for example, the older 'mpt' driver. And that the configuration settings where, generally, the same. We where asked to provide GUDS results and even a ' savecore -L '

From the livedump of the Solaris kernel, they identified that for some reason, the mpt_sas driver was disabling all native tagged command queing functions for the controller: 

mpt
 
   ushort_t m_props_update = 0
   ushort_t m_notag = 0xffff
   ushort_t m_nowide = 0xe080
 
sd
 
CAT(vmcore.0/11X)>  sdump 0xffffc100081310c0 sd_lun | egrep 
'throttle|queue|arq'
&nbsp; short un_throttle = 3 <============= this is the value visible in the 
iostat
&nbsp; short un_saved_throttle = 3
&nbsp; short un_busy_throttle = 0
&nbsp; short un_min_throttle = 8
&nbsp; timeout_id_t un_reset_throttle_timeid = NULL
&nbsp; unsigned :1 un_f_arq_enabled = 1&nbsp; 
&nbsp; unsigned :1 un_f_opt_queueing = 1 <=============== queueing only at the HBA 
level
&nbsp; unsigned :1 un_f_use_adaptive_throttle = 0
CAT(vmcore.0/11X)>

While on Solaris 10, the same driver, with the exact same emulated controller, was setting:

mpt
 
   ushort_t m_props_update = 0
   ushort_t m_notag = 0xffc0
   ushort_t m_nowide = 0xffc0
sd
 
CAT(vmcore.0/10X)> sdump  0xffffffff853ed000 sd_lun | egrep 
'throttle|queue|arq'
  short un_throttle = 0x100 <============ 256 commands supported
  short un_saved_throttle = 0x100
  short un_busy_throttle = 0
  short un_min_throttle = 8
  timeout_id_t un_reset_throttle_timeid = NULL
  unsigned int :1 un_f_arq_enabled = 1
  unsigned int :1 un_f_opt_queueing = 0 <============== no HBA cmd queueing
  unsigned int :1 un_f_use_adaptive_throttle = 0
CAT(vmcore.0/10X)>

You can find the reference to these options in the mpt driver here

 * This u_short is a bit map for targets who don't appear
	 * to be able to support tagged commands.
	 */
	ushort_t	m_notag;

In other words, the sd (SCSI disk) driver was making wrong assumptions about the SCSI devices. It was ignoring some capacities (like TCQ) of the disks exposed by the VMware platform.

It was clear this was a bug introduced somehow into Solaris 11. But quite an easy fix to make. We had an IDR (Interim Diagnostic and Relief patch) within several weeks, and in March of 2017, the official patch with the fix came out. (see top of post). It has since been back ported to all versions of Solaris 11 and is included in Solaris 12

Here is a very healthy looking Solaris 11.3 result:

This was an extremely interesting case to come across. But what really surprises me most, is how no one ever ran into this before. Solaris 11 has been out since 2011. I know that Solaris is rare these days, and considering Oracle's support policy, perhaps even rarer on VMware vSphere. But anyone running any kind of disk i/o intensive load on Solaris 11 on ESX should have run into this. Perhaps they have, but have just never been able to ecactly pinpoint it.  

It is a shame that Oracle knowledgebase is closed off, not indexed by Google. This prevents people from finding fixes to issues like this. Instead, for the past 5 years, many many people may have mistakingly thought that Solaris 11 on ESX simply sucked for no apparent reason. 

Maybe this is one of the many reasons people have abandoned Solaris as an OS.

What impressed me most was that, once we got the right people together, things moved very quickly. Valentin and Remi both showed excellent professionalism and deep knowledge. Perhaps it is no surprise then that Valentin Bondzio is a contributing author to Frank Dennemans and Niels Hagoort upcoming book: “vSphere 6.5 Host Resources Deep Dive”

And the Solaris and Oracle admins of our customer are extremely impressed with the newfound performance of their VMs, now that this buggy brake has been taken off.
In one case, where we really pushed the platform with 8vcpu and extremely high concurrent threads, where first a mere 1000 (read) iops where possible, we now measured up to 140.000 (read) iops during a single test.

Now I must show off some of those sexy numbers of course. Consider this a celebration of success 😉

This is a nice showcase for the EMC Vplex also, btw.

 


Leave a Reply