Troubleshooting VMware Storage Latency Problems – iSCSI/FC Storage

Has the storage admin ever said to you that they’re sure that VMware is causing  storage performance issues? Are you the storage admin and are having trouble understanding if your storage is running slow for your VMware environment? Both of these scenarios are REALLY common. Here we’ll examine how to use esxtop to show you how each part of the VMware storage stack is performing.

Storage latency in your stack can easily be examined by using a real time command line tool on available on each esxi host command line called esxtop. At first glance esxtop can be really scary with lots of numbers and columns and such, but by digging in a bit and just focusing on one section at a time it’s easy to begin picking things up.

Login to an esxi host that is running the VM with “unacceptable” performance and execute the following to get going:

~ # esxtop

This will launch esxtop on your happy little command line. If all goes well, you’ll get a screen that looks like this:

esxtop

What esxtop should look like.

WARNING – APPLE USER TANGENT:

If you happen to be using a Apple product to connect to your ESXi host, you might see something like this:

Default Apple ESX top.

Default Apple ESX top.

No need to fear, there are some simple terminal settings that you need adjust for the application to display correctly. Blogging giant and great guy, Jason Nash, can help you solve this issue quickly in his great blog post over here.

TANGENT OVER

Let’s navigate over to the help section for esxtop. Press ‘h’ on your keyboard to pull up the help screen for the application:

esxtop has a handy help screen!

esxtop has a handy help screen!

You’ll notice that we have  a bunch of options presented so that the display can be customized. There is also a really great option to allow us to update the number of seconds between each update of data (by default esxtop updates every 5 seconds). To look at performance of the host’s disk subsystem, we need to select ‘d’ for disk adapter. That should pull up a screen that looks like this:

"disk adapter" esxtop screen

In my lab I kicked off a migration from an NFS based datastore to the iSCSI datastore mounted on this host. The software iSCSI adapter in my lab happens to use the vmhba40. You’ll notice that vmhba40 gives us some nice stats to analyze.

So what do each of these numbers represent? Let’s dig into each of them:

NPTH –  represents the number of paths to the device. In an enterprise environment each device should have at least 2.

SIDEBAR… Just because the number here says two it’s important understand that the underlying infrastructure still may not be optimal. In my lab I’m actually using one physical switch with two separate IP’s for the iSCSI datastore. While the number from esxtop looks okay, the actual infrastructure has single points of failure. Be sure that you know how many switches your environment has, and how many physical HBAs are in each host that are connecting to the switches. Bottom line… dont’ assume things are okay, just because esxtop tells you that they are here.

CMDS/s – Total number of commands per second. While these are normally mostly I/O related commands, SCSI reservations and metadata operations do count towards this total.

READS/s – Total number of read commands per second – Read IO commands per second

WRITES/s – Total number of write commands per second – Write IO commands per second

MBREAD/s – Megabytes read by the host per second – general host read bandwidth

MBWRTN/s Megabytes written by the host per second – general host write bandwidth

DAVG/cmd – Device average per command listed in milliseconds. This represents the time that it takes for a command to be passed to and  serviced from your storage device. This is the number that should be discussed with storage teams for analysis as it represents time that a storage command spends outside of the ESXi environment.

KAVG/cmd – The average total of the time in milliseconds that the virtual adapter and esx scsi adapter take to pass a command to the HBA driver on the ESXi host. This number shouldn’t reach 1 as the workload is internal on the system.

GAVG/cmd – Guest Average – This is the average response time total in milliseconds that guest VM’s are experiencing running on the host. Times consistently above 20ms will likely cause end user complaints. Be sure that you’re looking at your storage when you start going over 10ms on average. For database hosts, it’s best if this average is below 5ms.

QAVG/cmd – ESXi Storage Stack Average – This time represents time storage requests spend in the storage queue stack. Any time over 1 should be investigated with an eye towards ensuring queue depths are set at the proper levels.

A really nice figure was put together by VMware to show how all of these areas relate to each other:

Pretty visual graphic of how it all fits together.

Pretty visual graphic of how it all fits together.

Putting it all together:

To give an example with workload in a real environment, I kicked off a storage VMotion writing to an iSCSI datastore in my homelab with esxtop running. That process generated the following esxtop output:

"disk adapter" esxtop screen

Breaking down the IO that’s being displayed by section we can determine that the IO is largely writes (WRITES/s) and that my host is writing approximately 33MB/s of data to my iSCSI storage array (MBWRTN/s). During this process, VM guests on my host were experiencing an average of 9.43ms response for their normal storage operations (GAVG/cmd). QAVG and KAVG represent the VMware kernel overhead during these operations and are right where they should be showing little to no contention from the kernel or device queues. In general the performance of the VM guests during the storage VMotion operation are acceptable.

I hope that this post helps to demystify some of the internals of storage performance in vSphere environments. Questions? Feel free to ask below!

 

Leave a Reply

Your email address will not be published. Required fields are marked *