Streamlining the Virtual SAN Configuration

For most VMware customers the integration of vSphere with Virtual SAN not only provides a simple an easy way to provide Enterprise storage and data services to your VM’s but also simplified operations in the Datacenter.

As Virtual SAN has grown through the releases there has been a need to accomodate further use cases for our customers. VDI, ROBO, Stretched Clusters etc. Virtual SAN 6.2 now has a new configuration wizard which provides a more streamlined approach for these more complex configurations.

A quick glance at the new Configuration Wizard shows how we provide customers the ability to select the disk claiming method, whether to enable deduplication and compression for the cluster and specifically which type of VSAN cluster deployment they might be after.

Screen Shot 2016-02-16 at 9.19.45 AM

In addition, the wizard also provides validation of the network configuration for Virtual SAN interfaces. Of course each host in the cluster needs to have a single VSAN vmkernel interface enabled to be able to participate. You can see by the screenshot below that I have missed one host. I also get information such as which vmk interface, portgroup and IP are assigned.

Screen Shot 2016-02-16 at 9.24.36 AM

If I rectify the mis-conifgured host but enabling the VSAN traffic type on one of my vmk ports I can now see that validation passes and I can move on.

Screen Shot 2016-02-16 at 9.26.42 AM

The next step is to claim my disks. Now If I had chose ‘Automatic’ back in the first step I could skip this. However for All Flash we do need to claim the different devices manually.  Starting in 6.2 there is a faster and more simplified way of bulk claiming disks for the VSAN cluster.

Screen Shot 2016-02-16 at 9.33.13 AM (3)

Screen Shot 2016-02-16 at 9.33.26 AM (3)

or alternatively I can do this by grouping by disk model or size.

Screen Shot 2016-03-21 at 8.18.11 AM

Either way, once cache and capacity disks have been claimed we are now ready to complete the wizard.

Screen Shot 2016-02-16 at 9.40.11 AM

Depending on the number of hosts in the cluster this will take a minute or two to configure and once completed you will have a fully running VSAN cluster.

For more in depth process on creating Stretched Clusters using this wizard please reference this article VSAN Stretched Clusters – Click Click done!

For a video on configuring VSAN using this wizard see 

VSAN Stretched Clusters – Click Click done!

Creating Stretched Clusters could not be easier with VSAN. In fact the process of configuring different types of VSAN clusters, with features like deduplication and compression enabled is now a snap in the new Configuration Wizard.

In this article  will spend a few minutes walking through the wizard to get my Stretched Cluster setup. But first here is a quick description of my environment. Pretty standard, 2 DC’s in the metro area about 30km apart and a head office where I will run my witness vm.

Screen Shot 2016-03-20 at 9.03.18 AM

So let’s get started. First you need to download and deploy the witness appliance. It’s a basic appliance configuration and will ask you what size appliance to create based on the number of VM’s to be managed. Grab it from the vSphere download link on the website.

Screen Shot 2016-03-20 at 7.00.35 AM (2)

As it stands I have 4 hosts in the cluster but am yet to configure VSAN. Commencing the configuration wizard we are asked if we want to create a standard, 2 node or Stretched Cluster. In VSAN 6.2, as mentioned before, we streamlined the configuration of all of these deployments with one wizard.

Here I’m simply going to select Automatic claiming of my disks and select Stretched Cluster.

Screen Shot 2016-03-20 at 8.32.59 AM

We now have validation of the VSAN networks for each host in the cluster and I can see clearly my VSAN VMkernel network has been correctly configured, as per any normal VSAN host.

Screen Shot 2016-03-20 at 8.33.11 AM

We now have validation of the VSAN networks for each host in the cluster and I can see clearly my VSAN VMkernel network has been correctly configured, as per any normal VSAN host.

Firstly I have renamed both of my Fault Domains to reference the location of the Datacenters. Initially I need to place the hosts into the correct Fault Domain and I can do this with a single click.

Screen Shot 2016-03-20 at 8.40.42 AM

So now my Fault Domain mirror what my physical deployment represents. I have two Dell hosts in the City Datacenter and two in the Clayton Datacenter.

Screen Shot 2016-03-20 at 8.41.11 AM

Of course with VSAN Stretched Clusters I require access to the Witness appliance which I downloaded an deployed earlier. This  appliance actually runs back at my main site. Here in this step I simply need to select the Witness for the cluster. You can see that this Witness is essentially a virtual ESXi host however we have made it clear to differentiate by using the blue icon.

Screen Shot 2016-03-20 at 8.41.30 AM

Depending on my choices I may need to ensure I claim the storage on this Witness appliance. It is important to do this as VSAN will today always expect to find the a similar configuration across data nodes and witness nodes. Hence we have created the appliance with a Cache disks and a Capacity disk.

Screen Shot 2016-03-20 at 6.57.19 AM

Verify the settings and complete.

Screen Shot 2016-03-20 at 8.41.38 AM

Now if we navigate over to the Stretched Cluster Management we can verify that it is correctly enabled. We also can see that the preferred Fault Domain is the Clayton Datacenter. This means in the event of a split-brain scenario the Clayton Datacenter and the Witness will form a quorum and the City Datacenter will be isolated. In this case HA will power on VM’s in the Clayton Datacenter. The Preferred Site is also denoted by the yellow star on the Clayton node.

Today a Stretched Cluster can maintain only one failure however other mechanisms such as SRM and VSAN Replication can be used to provide further protection against failures if required.

Screen Shot 2016-03-20 at 6.58.46 AM

Monitoring the health of a Virtual SAN cluster is paramount and a Stretched Cluster is no exception. In VSAN we now have some additional information in the Health Checks which help to monitor a Stretched Cluster effectively. Below is a grab of the checks we now have to ensure you stay happy and healthy.

I have introduced some artificial latency too demonstrate what happens here. As you will note the requirement for RTT latency between the two datacenter is documented as up to 5ms.

Screen Shot 2016-03-20 at 7.13.56 AM

If I increase the latency I would see an error such as the one below and in addition in VSAN 6.2 now we also have Health Alarms on the Summary pages

Screen Shot 2016-03-20 at 7.34.53 AM

Alarms can be programmed to send Email or SNMP type alerts so other people or systems can get notification when things aren’t working correctly.

Screen Shot 2016-03-20 at 7.37.34 AM


Other Considerations for Stretched Clusters might be different networking topologies, Host Groups and Rules, HA settings. All are covered in the Stretched Cluster guide below.


There is a wealth of information available for Stretched Clusters today

VSAN 6.2 Stretched Cluster Guide

VSAN Stretched Cluster Bandwidth and Sizing Guide

VSAN Stretched Cluster Performance and Best Practices

VSAN Stretched Clusters and SRM

Designing a VSAN Stretched Cluster

VSAN Stretched Cluster demo

VSAN Stretched Cluster supported network topologies


Hopefully you can now see how simple and intuitive it is to create VSAN clusters now with 6.2.

Virtual SAN 6.2 – What’s in a release?

The Virtual SAN 6.2 release was a milestone for the company. In the release we really focused on not only bringing enterprise class data efficiency features but we also increased the simplicity and VSAN operations features in the vSphere Web Client for customers.

I can tell you the amount of work that goes into such a release takes all hands to the pump.

Although we make every effort to ensure our partners and customers have access to the latest material that we develop for this release it can some times be hard to keep up. I’ve collected a list of resources here that you hopefully are aware of.

New product enhancements include:

  • All-Flash Deduplication/CompressionDeduplication eliminates duplicate copies of repeating data within the same Disk Group. This feature is enabled/disabled for the whole cluster. Dedupe and compression happens during de-staging from the caching tier to the capacity tier (Deduplication first and then compression). The main value add is storage savings that could range between 2x to 7x depending on the workloads
  • All-Flash Erasure Coding(RAID5/RAID6): Erasure coding(EC) is a method of data protection in which data is broken into fragments that are expanded and encoded with a configurable number of redundant pieces of data and stored across different nodes. This provides the ability to recover the original data even if some fragments are missing. Erasure coding will provide a much more storage efficient way of providing FTT=1 and FTT=2 on VSAN. This is a setting per VMDK/Object (RAID5 is 3+1 and RAID6 is 4+2).
  • Performance monitoring: Comprehensive performance monitoring in vCenter UI with common metrics across all levels (cluster, individual physical SSD/HDD, virtual disks)
  • Capacity reporting: A VSAN specific capacity view to report cluster-wide space utilization as well as detailed breakdown by data/object types. When dedup/compression/EC is enabled, it shows normalized space utilization and savings of these features
  • Health Service enhancements: proactive rebalance from UI, event based VC alarming, etc.
  • SDK for Automation & Third party integration: VSAN management SDK extended from vSphere API to deploy/config/manage/monitor VSAN. This will be available through many language binding (Soap, .Net, Java, Perl, Python, and Ruby) & will include code samples.
  • Software Checksum: End-to-end checksum that helps provide data integrity by detecting and repairing data corruption that could be caused by bit rot or other issues in the physical storage media. checksum is enabled by default, but may be enabled or disabled on per virtual machine/object basis via SPBM.
  • Quality of Service (QoS): Set IOPS Limits: This provides the ability to set the maximum IOPS a VMDK can take (This will be a hard limit). This is a setting per VMDK, through Storage Policy-Based Management (SPBM). Customers wanting to mix diverse workloads will be interested in being able to keep workloads from impacting each other and avoiding the noisy neighbor issue.
  • IPv6 Support: Support for pure IPv6 environments

Updated Collateral

There will be more updates in the coming weeks. Also some blogs to look at for more information

There are plenty of others. The best place to stay up to date is to bookmark our Technical Resources place or follow us on twitter @vmwarevsan

VSAN Power Outage

These days speaking with customers they like to discuss all the operational impacts and failure scenarios when deciding on Virtual SAN. Things such as, what happens when a disk fails, a controller, a node, a network etc. In some instances customers even asked what happens if all of my nodes fail at once, perhaps fearing data loss or corruption or problems when power is restored. This is absolutely NOT the case and does not occur. Read on.

Every quarter there is a scheduled power outage in our building and as such we have to go through a process of powering down the lab and powering it back on.


Fortunately, we have a pretty good process for managing this by now and it’s pretty simple. On occasions when I’ve been travelling I haven’t had the opportunity to shut the lab VSAN lab down gracefully and to this day it’s never caused me an issue.

My little lab is modest and runs on 3 Dell R720 servers and I just did an in-place upgrade using VMware Update Manager to vSphere 6.0U1a (VSAN 6.1) this week for all the latest goodies. I am not running an Active-Active type VSAN Stretched Cluster (more on that here) in my work lab so I need to deal with localised outages as efficiently as I can.

It was late afternoon on the Friday and I received a reminder of the impending power outage. I quickly took a screenshot of the current lab as it was running at that time and decided to ease my way into the weekend and let fate take over.


Lab Screenshot – Prior to shutdown (Friday 2pm)

Fast forward, Monday morning. I entered the server room to check my servers had powered on. I sat down at my desk and checked to ensure vCenter had come back up as it is configured to do. I logged into the vSphere Web Client to find the environment in perfect health and my virtual machines in a healthy state.

Screen Shot 2015-11-23 at 10.43.31 AM

Lab Screenshot (Monday 10am)

I also used the VSAN Health Plugin to check the  data and object health for any weirdness. Everything seemed compliant and healthy. Happy Days! Just to be sure you can see the timeline in the Events tab.

Screen Shot 2015-11-23 at 11.04.13 AM

So with VSAN we are able to provide crash consistency for the virtual machines and all IO that has reached the cache tier is held persistently until the cluster comes back to a normal health status and de-staging may be required. I logged in quickly to a few Windows and Linux guest machines to see if they were all healthy and discovered no issues.

As you can see, VSAN worked as designed and as I stated before this has happened on several occasions over the course of the last 2 years and I am yet to experience a problem on return, this time I thought I’d share it with you. It’s actually pretty boring, but I’d argue that’s the best way to deal with major power outages and/or hardware failures.

The best thing I could point you to for understanding failure impact and troubleshooting on VSAN is a very comprehensive document we published called the VSAN Troubleshooting Reference Guide found here VSAN Troubleshooting Reference Manual – VMware.


Using SMP-FT with Virtual SAN 6.1

One of the new capabilities in VSAN 6.1 was the ability to support virtual machines that use Fault Tolerance. You may recall that early in 2015 when we released vSphere 6.0 we improved the mechanism for FT to work. The main driver behind this was efficiency and also the ability to support virtual machines with more than a single vCPU. The new technology used by FT is called Fast Checkpointing and is basically a heavily modified version of an xvMotion.

For a quick recap on Fault Tolerance.

  • Protect mission critical applications regardless of OS and application from a vSphere host failure. FT provides a ZERO RPO/RTO and does not require TCP connections to be reestablished with the guest operating system.
  • FT protects the virtual machine as a whole, so there is no specific operating system or application modifications that need to be done.
  • The new version of Fault Tolerance greatly expands the use cases for FT to approximately 90% of workloads. This is based on customer usage data for virtual machine configurations. 90% of VM’s deployed are 4 vCPU and 64 GB or less.
  • Configuration is simple. Point and click on the VM that you wish to protect and enable.
  • Support for up to 4 vCPUs and 64 GB Memory per VM
  • Maximum of either 8 vCPUs or 4 FT protected VM’s per host
  • Redundant VMDKs

Screen Shot 2015-09-19 at 9.41.47 am

Once an FT failover happens (i.e secondary changes role to become primary), HA requests hostd to start the secondary VM on a new host.  HA would then initiate a new FT migration on the primary VM to setup the FT protection again.  This is all done automatically without a dependence on vCenter.

Screen Shot 2015-09-21 at 2.39.29 pm

So how does this actually look from a storage perspective. The following graphic shows I have a VM called ‘photon_1’ which has a single VDMK of 10gb. The VM is configured for Failures To Tolerate = 1.

Screen Shot 2015-09-21 at 3.21.49 pm (2)

The Relationship between the Primary and Secondary VM is shown from the vsanDatastore view where we can see the photon-1 and photon-1_1 VM’s (photon-1_1 being the secondary).
Screen Shot 2015-09-21 at 3.44.18 pm (2)

During a Test Failover we can see that the Primary has failed and the Secondary is taking over and becoming the new primary and new Secondary is being provisioned. In the image below we can see the Protected Status of the VM and no loss of ping during this failover process.

Screen Shot 2015-09-21 at 3.40.44 pm (2)

I have provided a video of the configuration process and failover test here:

From a VSAN perspective it is important that the VMDK and the Namespace have the correct FTT=1 configured as the VSAN Policy. The reason for this is when a VM loses its storage, its behaviour is not defined. It may continue running for a while and then crash the GuestOs, without the VM going down. In this case, FT will not take over. Having FTT=0 makes this scenario more likely to  happen than when FTT is set to 1.

 The other reason is that the FT tie-breaker file needs to be accessible to the secondary at the time of a failover. For FT VM’s where the primary/secondary are on the same datastore such us in VSAN, the FT tie-breaker file is placed in the primary VM’s namespace directory. If the primary VM host goes down and takes its storage along with it, the secondary VM will not be able to access the FT tie-breaker file and hence will not be able to go live. This holds true even if the tie-breaker file were created in the secondary VM’s namespace directory or in its own directory.

It’s also important to understand the difference between both mechanisms here, VSAN and FT. VSAN ‘s availability model is that of eventual consistency and high availability whereas FT is designed for immediate recovery or ‘Fault Tolerance’ of a host related outage. In the past FT has provided the protection against hardware failure but has left the storage redundancy and availability up to the customer which in some cases is not always provided for. The benefit of VSAN in this case is that both protection against a mixture of storage outages and host outages to provide a true highly available and integrated solution.

Use Cases for FT

  • Any non-latency sensitive applications supported on vSphere (up to 4 vCPU amd 64GB Memory)
  • Applications that cannot be protected using other methods
  • CPU intensive applications
  • Workloads that are tolerant of network latency

In conjunction with VMware HA, VMware Data Protection, VSAN Replication, Site Recovery Manager, VSAN Stretched Clusters; Fault Tolerance is yet another DA/DR solution that can be utilised with vSphere/VSAN environments.

For more information on Fault Tolerance please read the FAQ here and the vSphere 6.0 Availability Guide