Using SMP-FT with Virtual SAN 6.1

One of the new capabilities in VSAN 6.1 was the ability to support virtual machines that use Fault Tolerance. You may recall that early in 2015 when we released vSphere 6.0 we improved the mechanism for FT to work. The main driver behind this was efficiency and also the ability to support virtual machines with more than a single vCPU. The new technology used by FT is called Fast Checkpointing and is basically a heavily modified version of an xvMotion.

For a quick recap on Fault Tolerance.

  • Protect mission critical applications regardless of OS and application from a vSphere host failure. FT provides a ZERO RPO/RTO and does not require TCP connections to be reestablished with the guest operating system.
  • FT protects the virtual machine as a whole, so there is no specific operating system or application modifications that need to be done.
  • The new version of Fault Tolerance greatly expands the use cases for FT to approximately 90% of workloads. This is based on customer usage data for virtual machine configurations. 90% of VM’s deployed are 4 vCPU and 64 GB or less.
  • Configuration is simple. Point and click on the VM that you wish to protect and enable.
  • Support for up to 4 vCPUs and 64 GB Memory per VM
  • Maximum of either 8 vCPUs or 4 FT protected VM’s per host
  • Redundant VMDKs

Screen Shot 2015-09-19 at 9.41.47 am

Once an FT failover happens (i.e secondary changes role to become primary), HA requests hostd to start the secondary VM on a new host.  HA would then initiate a new FT migration on the primary VM to setup the FT protection again.  This is all done automatically without a dependence on vCenter.

Screen Shot 2015-09-21 at 2.39.29 pm

So how does this actually look from a storage perspective. The following graphic shows I have a VM called ‘photon_1’ which has a single VDMK of 10gb. The VM is configured for Failures To Tolerate = 1.

Screen Shot 2015-09-21 at 3.21.49 pm (2)

The Relationship between the Primary and Secondary VM is shown from the vsanDatastore view where we can see the photon-1 and photon-1_1 VM’s (photon-1_1 being the secondary).
Screen Shot 2015-09-21 at 3.44.18 pm (2)

During a Test Failover we can see that the Primary has failed and the Secondary is taking over and becoming the new primary and new Secondary is being provisioned. In the image below we can see the Protected Status of the VM and no loss of ping during this failover process.

Screen Shot 2015-09-21 at 3.40.44 pm (2)

I have provided a video of the configuration process and failover test here:

From a VSAN perspective it is important that the VMDK and the Namespace have the correct FTT=1 configured as the VSAN Policy. The reason for this is when a VM loses its storage, its behaviour is not defined. It may continue running for a while and then crash the GuestOs, without the VM going down. In this case, FT will not take over. Having FTT=0 makes this scenario more likely to  happen than when FTT is set to 1.

 The other reason is that the FT tie-breaker file needs to be accessible to the secondary at the time of a failover. For FT VM’s where the primary/secondary are on the same datastore such us in VSAN, the FT tie-breaker file is placed in the primary VM’s namespace directory. If the primary VM host goes down and takes its storage along with it, the secondary VM will not be able to access the FT tie-breaker file and hence will not be able to go live. This holds true even if the tie-breaker file were created in the secondary VM’s namespace directory or in its own directory.

It’s also important to understand the difference between both mechanisms here, VSAN and FT. VSAN ‘s availability model is that of eventual consistency and high availability whereas FT is designed for immediate recovery or ‘Fault Tolerance’ of a host related outage. In the past FT has provided the protection against hardware failure but has left the storage redundancy and availability up to the customer which in some cases is not always provided for. The benefit of VSAN in this case is that both protection against a mixture of storage outages and host outages to provide a true highly available and integrated solution.

Use Cases for FT

  • Any non-latency sensitive applications supported on vSphere (up to 4 vCPU amd 64GB Memory)
  • Applications that cannot be protected using other methods
  • CPU intensive applications
  • Workloads that are tolerant of network latency

In conjunction with VMware HA, VMware Data Protection, VSAN Replication, Site Recovery Manager, VSAN Stretched Clusters; Fault Tolerance is yet another DA/DR solution that can be utilised with vSphere/VSAN environments.

For more information on Fault Tolerance please read the FAQ here and the vSphere 6.0 Availability Guide