Identifying VSAN Bottlenecks Using VSAN Observer

VSAN Observer is a Virtual SAN monitoring tool included in the Ruby vSphere Console. It provides 3 monitoring modes for a Virtual SAN cluster:

Live or Online Monitoring, Offline Monitoring and Raw Stats Bundle

In order to run VSAN Observer you should see this KB article provided by VMware http://kb.vmware.com/kb/2064240. In this article we will identify and troubleshoot a particular performance issue that has occurred.

In the image below we are showing the VSAN Client tab on the VSAN Observer webpage. This tab shows information, which is seen by the client (the VM) when accessing the VSAN storage. In this case we are focused on the Latency graph. Latency seen by this VM appears to fluctuate between 20-45ms within a 5 minute period as shown by the graph. High latency is generally the enemy of most workloads so let’s dig a little deeper and see what’s occurring.

1

If we move along the VSAN Observer tabs to the VSAN Disks tab we see similar graphs. As we are focused on the latency in this example let’s drill down on the latency VSAN see’s from accessing the physical disks. In the same 5 minute time period as before we see that the latency graph is almost identical. This tells me that the latency issue is common and rules out the network as the culprit for this latency.

22

Other types of bottlenecks might produce latency type symptoms in a VSAN cluster but if we look at the below graphs we see that the CPU Utilization and the VSAN worldlets are well in acceptable range. From this we can conclude that there are no CPU related issues.

3

Let’s go one step further and look at the physical disk layer. Here we see that the disk latency is fairly reasonable between 5-15ms and the Read Cache hit ratio is up around 90% which also tells me the cache is being utilized as it should be.

So let’s go back to the VSAN Disks tab for a deeper look at those graphs. One graph I didn’t touch on before was the Outstanding IO. Outstanding IO is IO that has yet to be completed. VSAN has a scheduler that runs at the VSAN Disks level to prioritize traffic classes and essentially minimize the outstanding IO and hence keep latency low. So why is it so high in this case?

5

The devil here is in the detail of the SSD. The specifications of this SSD showed that it was capable of 4k-6k Write IOPS at 4 Outstanding IO. In the graph above we are seeing 200 OIO.

This means that a batch of 4 IOs take between 0.75 – 1ms

200 OIO => 50 x 4 IO batches which equates to 50 x .075 -1ms = 35-50ms

If you recall in the beginning we witness somewhere between 20-45ms of latency. It’s clear that with the limitations of the SSD device we are looking at here it’s absolutely expected to be seeing this performance problem present itself.

In this situation I have highlighted the relationship between OIO, latency and IOPS. It is actually possible to compute the latency if one knows the OIO of the test scenario and the IOPS of the device. Although this is one specific simulated problem, hopefully you can see how simple and useful VSAN Observer is to monitor and also troubleshoot performance issues should you need to.

 

Thank you to Christian Dickman (Tech Lead – VSAN) who allowed me to publish this article from his VMworld 2014 Presentation.

Look forward to a blog series on blogs.vmware.com coming soon with more scenarios like the above.

 

Using a Remote RVC Host

Occasionally I am asked is there a downside to running the VSAN Observer on vCenter. Generally my recommendation is if a customer is using it for short term monitoring or troubleshooting in a small(ish) environment then it should be fine. For customers who are using it in larger clusters and in more anger I’d recommend a different approach.

VSAN Observer is a tool used to analyse the performance of a VSAN environment. Customers and VMware Support can use this tool to gain a deeper insight into their environment. Running VSAN Observer – http://www.punchingclouds.com/2013/09/03/vsphere-5-5-using-rvc-vsan-observer-pt2/

vCenter running VSAN Observer will hold all of the session history in it’s memory until it either times out or is manually is halted by the uses with the <Ctlr>+<C> combination. At this moment the default interval at which VSAN Observer collects stats is 60 secs, which is quite aggressive. If you think about vCenter Operations it will collect stats every 5 minutes so the use case of the two is actually quite different. VSAN Observer is meant for running for short periods, perhaps those for on the spot troubleshooting or point in time monitoring.

In situations where VSAN Observer is used often for heavier work it might be suitable to configure another vCenter Server as a dedicated RVC instance. Doing so would remove any potential impact on your production vCenter Server.

Of course this depends greatly on the customers usage and ability to manage another vCenter instance. This vCenter instance should be a cut down version not integrated into the production environment with SSO and the like. Using the vCenter Virtual Appliance would be the simplest recommendation I’d give. Just install a new vCVA and configure with a basic embedded config as per normal. In fact, it’s entirely possible to stop all the vCenter services on this server if really required.

 

NOTE: You can change the VSAN Observer collection interval and max runtime

–interval or –i                       – Interval in Sec in which to collect (default:60)

–max-runtime or –m            – Maximum number of hours to collect stats (default:2)

 

After installing the second vCenter simply open an ssh connection to it and type rvc root@prodvc (where prodvc is your production vCenter server).

 

If this is the first time you have connected you will be prompted to accept the warning:

Host to connect to (user@host): root@mysecondvc

The authenticity of host ‘mysecondvc’ can’t be established.

Public key fingerprint is 6d62e48e1c8d80fc651359a356edc51b949097d68d1ca085fcd5415f8d46734e.

Are you sure you want to continue connecting (y/n)? y

Warning: Permanently added ‘mysecondvc’ (vim) to the list of known hosts

password:

 

Once the password is entered correctly you’ll be at the prompt and all you are good to go. I should re-iterate that only special cases would call for this type of configuration but it is definitely possible.

 

RVC fun with VSAN

In the first part of exploration into the Ruby vSphere Console I have stumbled across so many new and interesting ways to interact with VSAN.

For a brief introduction to the RVC I suggest you start back here http://www.punchingclouds.com/2013/08/30/vsphere-5-5-vsphere-ruby-console/.

I came across this little nugget I was not aware of the ‘table’ command which I couldn’t find much documentation for. Although fairly straight forward, the use of the command is really quite powerful.

/localhost/DCName> table -h

usage: table [opts] obj…

Display a table with the selected fields

You may specify the fields to display using multiple -f options, or

separate them with ‘:’. The available fields for an object are

shown by the “fields” command.

obj: Path to a RVC::InventoryObject

–field, -f <s>:   Field to display

–sort, -s <s>:   Field to sort by

–reverse, -r:   Reverse sort order

–help, -h:   Show this message

/localhost/VSAN-DC> table -f name -f state.connection:num.vms:num.poweredonvms:cpuusage:memusage:uptime:build ~/computers/*/hosts/*

Screen Shot 2014-05-24 at 10.38.39 pm

See, pretty cool hey!

I’ll have lots more stuff coming soon.