VSAN Observer is a Virtual SAN monitoring tool included in the Ruby vSphere Console. It provides 3 monitoring modes for a Virtual SAN cluster:
Live or Online Monitoring, Offline Monitoring and Raw Stats Bundle
In order to run VSAN Observer you should see this KB article provided by VMware http://kb.vmware.com/kb/2064240. In this article we will identify and troubleshoot a particular performance issue that has occurred.
In the image below we are showing the VSAN Client tab on the VSAN Observer webpage. This tab shows information, which is seen by the client (the VM) when accessing the VSAN storage. In this case we are focused on the Latency graph. Latency seen by this VM appears to fluctuate between 20-45ms within a 5 minute period as shown by the graph. High latency is generally the enemy of most workloads so let’s dig a little deeper and see what’s occurring.
If we move along the VSAN Observer tabs to the VSAN Disks tab we see similar graphs. As we are focused on the latency in this example let’s drill down on the latency VSAN see’s from accessing the physical disks. In the same 5 minute time period as before we see that the latency graph is almost identical. This tells me that the latency issue is common and rules out the network as the culprit for this latency.
Other types of bottlenecks might produce latency type symptoms in a VSAN cluster but if we look at the below graphs we see that the CPU Utilization and the VSAN worldlets are well in acceptable range. From this we can conclude that there are no CPU related issues.
Let’s go one step further and look at the physical disk layer. Here we see that the disk latency is fairly reasonable between 5-15ms and the Read Cache hit ratio is up around 90% which also tells me the cache is being utilized as it should be.
So let’s go back to the VSAN Disks tab for a deeper look at those graphs. One graph I didn’t touch on before was the Outstanding IO. Outstanding IO is IO that has yet to be completed. VSAN has a scheduler that runs at the VSAN Disks level to prioritize traffic classes and essentially minimize the outstanding IO and hence keep latency low. So why is it so high in this case?
The devil here is in the detail of the SSD. The specifications of this SSD showed that it was capable of 4k-6k Write IOPS at 4 Outstanding IO. In the graph above we are seeing 200 OIO.
This means that a batch of 4 IOs take between 0.75 – 1ms
200 OIO => 50 x 4 IO batches which equates to 50 x .075 -1ms = 35-50ms
If you recall in the beginning we witness somewhere between 20-45ms of latency. It’s clear that with the limitations of the SSD device we are looking at here it’s absolutely expected to be seeing this performance problem present itself.
In this situation I have highlighted the relationship between OIO, latency and IOPS. It is actually possible to compute the latency if one knows the OIO of the test scenario and the IOPS of the device. Although this is one specific simulated problem, hopefully you can see how simple and useful VSAN Observer is to monitor and also troubleshoot performance issues should you need to.
Thank you to Christian Dickman (Tech Lead – VSAN) who allowed me to publish this article from his VMworld 2014 Presentation.
Look forward to a blog series on blogs.vmware.com coming soon with more scenarios like the above.