Why VSAN and ROBO is a Perfect Match

In my many conversations with customers I hear the same feedback over and over:

  • Customers spend too much money on ROBO solutions.
  • Customers do not have infrastructure staff on site
  • Customers don’t run enough compute to justify large amounts of infrastructure
  • Customers typical scale storage capacity on site, rather than compute and performance.

In VSAN 6.1 we are removing the requirement for 3 physical VSAN Nodes. By placing the witness VM offsite in the customers central datacenter or even vCloud Air this will allow the customer to deploy only 2 physical nodes on site.

Removing the 3 physical nodes requirement provides a lower cost option, yet still allows remote management from a remote vCenter.

Some of the benefits I can see with this model for such use cases are:

  • Capex and Opex reduction per site
  • Granularly scalable for capacity, compute or performance in the same model.
  • Automated failover on site.
  • Single upgrade/maintenance procedure
  • Single support call
  • Single architecture for Datacenter and Remote Sites (inc 3+ Nodes for larger sites)
  • Integration with DR, Backup, Cloud and Management framework.

Screen Shot 2015-09-15 at 10.32.19 am

 

Apart from the physical design the VSAN ROBO licensing now provides for a per-vm model (purchased in 25 VM pack). This per-vm model allows for customer running fewer than 25 VM’s per site and is generally a more cost effective way of achieving licensing nirvana vs the per-cpu model in smaller sites. The beauty of the per-vm model is that a customer can break up the 25 vM pack amongst many sites. Many customers have 5-10 VM’s per site, which means potentially a single pack could stretch across 5 (or even more) sites. Note you cannot use more than 25 VM’s per site in this model. If you have more than 25 VM’s per site then per-cpu will be more cost effective anyway.

To get started you’ll need:

  • 2 low cost economy Ready Nodes (2 or 4 Series might be ideal, can be a very low cost option)
  • VSAN + vSphere ROBO SKU (25 VM pack) split across sites if required.
  • Witness VM deployed in DC. (One required per site)
  • vCenter in the main Datacenter or in the remote site if required.

Here is a look at the HP Ready Node HY-2 (Hybrid).  You can check out the others from other vendors http://www.vmware.com/resources/compatibility/search.php?deviceCategory=vsan

 

Screen Shot 2015-09-15 at 8.57.25 am

Other optional add-ons

  • vDPA and VSAN Replication (5 min RPO) included
  • DR or Backup could be to a separate VSAN Cluster in the DC (avoid prod infra for DR)
  • vCloud Air DRaaS
  • vRealize Operations Management Pack for Storage Devices (Manage VSAN with vROPS)

I think this is a breakthrough for many customers looking for a robust, low cost, model for their remote sites. It changes the economics for a customer for their remote sites. The cost savings also grow linearly across all sites as they are deployed which for many customers can lead to $100,000’s in Capex savings. The additional benefits of using VSAN such as simplified configuration, lower operational overhead and a scalable architecture all start to come into play now as well which is only the icing on the cake for many customers.

Happy Days!

Setting VSAN.CLOMRepairDelay Cluster-Wide with PowerCLI

Virtual SAN advanced setting VSAN.ClomRepairDelay specifies the amount of time VSAN waits before rebuilding a disk object after a host is either in a failed state (absent failures) or in Maintenance Mode. By default, the repair delay value is set to 60 minutes; this means that in the event of a host failure, VSAN waits 60 minutes before rebuilding any disk objects located on that particular host.

Screen Shot 2015-08-24 at 8.06.34 am

Currently, the advanced setting VSAN.ClomdRepairDelay timeout has to be set on a per host basis. This can be painful when there are a lot of hosts in the cluster.

There is a KB article with instructions on how to do this on a per host basis. I thought it might be a reminder to show customers how to leverage PowerCLI to change this setting for all hosts in the cluster at once. Again, this gets more valuable the more servers you have in the cluster.

First you’ll need to connect to the vCenter Server using Connect-VIServer and enter your credentials.

Use the following one-liner to check the settings first.

PS C:\> Get-VMHost | Get-AdvancedSetting -name “VSAN.ClomRepairDelay”

Name     Value Type Description
—- —– —- ———–
VSAN.ClomRepairDelay 60 VMHost

Then use that command to use the Set-AdvancedSetting value to whatever value you like (30 in this case)

PS C:\> Get-AdvancedSetting -Entity (Get-VMHost) -Name “VSAN.ClomRepairDelay” | Set-AdvancedSetting -Value 30

Perform operation?

Modifying advanced setting ‘VSAN.ClomRepairDelay’.
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help (default is “Y”): a

# Now keep in mind for the setting to actually take effect you will be require to restart the clomd service on every host in the cluster by issuing /etc/init.d/clomd restart.

I wrote a quick script to do this. It requires plink.exe as it needs to SSH to every host to restart the service:

Screen Shot 2015-09-15 at 9.40.49 am

The script is provided with no support, guarantees or warranty.  You should test in an non-prod environment if you want to use it. Grab the script here setclomdelay.ps1

NOTE: There are future improvements I could make to the script, but it works.

Now.. It’s important to understand the motivation for this. If you truly have a use case for lowering the default time delay to 45 mins or 30 mins or some other setting then fine, although I personally don’t recommend it. Where I have seen this used is in POC or lab environments where you want to validate the behaviour. In a production environment, No.

If the goal here is to be able to enable faster rebuild of data without waiting for the timeout then we have you covered.

VSAN Health Services Plugin has a function that allows you to do exactly this. It’s called Immediate Object Repair. I blogged about it here http://vsanteam.info/immediate-object-repair-in-vsan/ and there is also a video on youtube which shows the process in action https://www.youtube.com/watch?v=uV2MIsqZzzk&list=PLjwkgfjHppDtONKrts8wrmZpdf35VCD7y&index=4.

 

So if your goal is to rebuild data faster when you realise you don’t want to wait for 60 minutes, don’t go changing the ClomRepairDelay, just use the Health Services Plugin.

How many Fault Domains?

Some weeks ago I was involved in a discussion with a customer on their design around fault domains. There was an interesting scenario which was proposed and I was asked for my opinion or validation of the aproach. Before I talk about the solution let me re-cap what Fault Domains in VSAN are.

The default storage policy in VSAN is to maintain 2 equal copies of a data object and 1 witness of metadata object, we call this Failures to Tolerate = 1. Traditionally, these objects would be automatically distributed across a separate host in the cluster allowing for the loss of 1 host, disk or other component in the cluster. This loss would only result in the VM’s compliance being negatively reported until the rebuild operation has completed and involves no VM downtime at all. In VSAN 6.0 we enabled Fault Domains which allows a customer to place hosts in a designated Fault Domain. If you are configuring Fault Domains, VSAN requires 3 and a minimum of 1 host per Fault Domain. In the primary example a Fault Domain would be a rack, hence why it is sometimes referred to as ‘Rack Awareness’. In this scenario, VSAN when distributes the copies and witness in the above example it will honour the Fault Domain configuration, meaning that no more than a single component which makes up the VM’s data will reside in a single rack. A VM with the default FTT=1 will distribute it’s data components on 1 host in every Fault Domain. This affords for the ability to lose an entire rack of infrastructure without any data availability impact to the VM. The images on the left represents data placement without Fault Domains configured. The right image represents data placement with Fault Domains configured.

Screen Shot 2015-09-06 at 4.47.56 am Screen Shot 2015-09-06 at 4.50.11 am

Configuring Fault Domains is done with a few clicks of the mouse in the vSphere Web Client.

A2

In the scenario proposed the customer had 4 racks worth of infrastructure and had purchased only the first 12 nodes of a larger deployment to commence their project. The question was in the scenario where the cluster size was going to be 24 nodes in total however we had 12 on hand to commence the deployment. We had two choices, 3 Fault Domains with 4 hosts in each or 4 Fault Domains with 3 hosts in each. Represented by Option A and Option B below

Screen Shot 2015-09-06 at 4.56.54 am Screen Shot 2015-09-06 at 4.57.03 am

Option A                                                                          Option B

Both of these designs are perfectly valid and in some smaller customer implementations 3 racks may not even be available and hence you may not configure Fault Domains. However in this customer scenario where rack space is at a premium and the cluster was building out to a much larger scale there were was one primary benefit to the 4 Fault Domain approach.

For a moment lets look at Option A with 3 Fault Domains. Picture a full production environment which might potentially place a total 400 data components on the hosts in each rack (100 per host). Now picture the impact of the loss of a Rack A. Not only have we lost a total of 4 compute nodes but we have also temporarily lost 400 data components for the VM’s that would be affected.

Now let’s look at Option B with 4 Fault Domains. Assuming the same number of total  data components in the cluster we may be able to spread the components further across the cluster meaning that we could have only 300 data components per rack but still the same number (1200) across the entire cluster. Now think about the impact during a rack failure, A rack failure would only affect redundancy for 300 objects and VM’s rather than 4. So what, you say. Well let’s put that in context of rebuild, rebuild times and IO. Now we have 3 remaining Fault Domains available so we can immediately rebuild the failed components and maintain Fault Domain compliance, also the time taken to rebuild the data components affected in the rack failure to another host would be less as there is less data(GB) to read/write. This is often a big positive for many customers. Time-to-recovery is something many customers look at and VSAN’s distributed recovery model is a distinct advantage.

8

In addition, if you think of Fault Domains in the context of Maintenance Mode. If I am required to do rack maintenance or just maintenance on the hosts in a single rack I can maintain data availability during that period. In this case, the hosts in Fault Domain 4 could be used to maintain full data availability/compliance during this maintenance period.

In this case there was enough benefit for the customer to implement 4 Fault Domains rather than 3. It provided them and more desirable outcome in the event a rack failure or maintenance scenario. Now you could potentially extrapolate this scenario out to 5,6 or more Fault Domains however there will come a point where extending racks for the sake of it is impractical. Morevover, you will want to design your Fault Domains in context of the FTT policy you are likely to configure. For example, If you plan to implement FTT=2 which is 3 equal data components then you would need to distribute these across 5 Fault Domains.

VSAN_DSG_Img4

In general, if you are planning for rack failures or rack maintenance in your datacenter and are planning on implementing Fault Domains, be sure to think about your host and data distribution in your own environment and plan/design accordingly.

 

VSAN in 3 Minutes – Youtube Series

A while ago we started on a campaign to produce short, sharp video demonstrations for VMware Virtual SAN. We chose 3 minutes for two reasons, firstly because it’s a good amount of attention span and secondly it seems to be a decent amount of time to demonstrate a single product feature or function.

The goal here is enablement and simplistically demonstrating some of VSAN’s capabilities.

Hope you enjoy them. They are updated periodically so be sure to check back.

https://www.youtube.com/playlist?list=PLjwkgfjHppDtONKrts8wrmZpdf35VCD7y

 

Ready Set Go! – Welcome to the new VSAN Ready Nodes

Today the VMware Virtual SAN team launched the new Ready Node initiative.

The VSAN Ready Nodes are our primary Go-To-Market from a hardware enablement perspective for VSAN. This model provides many benefits to customers. The Ready Node model allows a turnkey solution for Virtual SAN deployment. They are a validated and certified configurations jointly recommended by VMware and server OEM’s that provide both a simplified single SKU and flexibility of choice for both hardware vendor and hardware configuration. In fact, many customer use the Ready Nodes today as a basis for their hardware configuration, whilst having the flexibility to add more capacity or memory etc

VMware works very closely with the server OEM’s to identify  the top selling server platforms suitable for HCI and we jointly work to certify these platforms for Virtual SAN.

Introducing today, we have re-branded the Ready Nodes from Low, Medium an High to 2 Series, 4 Series, 6 Series, 8 Series. The 6 Series and 8 Series will be for All Flash models. In addition we are introducing a new Low Cost Economy model (HY-4 Series). The reason for this was there was a significant difference between the Low and Medium nodes. So now we have a new Ready Node which is part of the HY-4-Series which sits in the sweet spot for customers who are looking for a better deal on their hardware for the most popular configuration in the field.

We also launched the new Ready Node led VCG. This is a new an improved addition to the VCG which allows the customer to select from a dropdown box the requirement for Ready Node, Hybrid or All Flash. Select the supported release, server OEM, Generation, Capacity etc

Screen Shot 2015-08-25 at 2.25.28 pm

The customer can then search the nodes in each series that are suitable for their requirements.

Screen Shot 2015-08-25 at 2.24.54 pm

Notice the blue hyperlinks in the details where a customer can still click on the exact disks or controller for more information on the specifics of the certified components such as Endurance and Performance Class and firmware versions etc.

It should be noted that customers still have the flexibility to ‘Build Your Own’. VMware strongly recommends using certified Ready Nodes that are validated to provide predictable performance and scalability for your Virtual SAN deployment. However, If you would still like to build your own Virtual SAN with certified components you are still able to (the link is near the bottom of the VCG site).

The link to the VSAN remains the same http://www.vmware.com/resources/compatibility/search.php?deviceCategory=vsan or  http://vmwa.re/vsanhcl for those who cant remember the full url.

Hopefully this is a more flexible and easy way for customers to choose their nodes. Happy hunting.