These days speaking with customers they like to discuss all the operational impacts and failure scenarios when deciding on Virtual SAN. Things such as, what happens when a disk fails, a controller, a node, a network etc. In some instances customers even asked what happens if all of my nodes fail at once, perhaps fearing data loss or corruption or problems when power is restored. This is absolutely NOT the case and does not occur. Read on.
Fortunately, we have a pretty good process for managing this by now and it’s pretty simple. On occasions when I’ve been travelling I haven’t had the opportunity to shut the lab VSAN lab down gracefully and to this day it’s never caused me an issue.
My little lab is modest and runs on 3 Dell R720 servers and I just did an in-place upgrade using VMware Update Manager to vSphere 6.0U1a (VSAN 6.1) this week for all the latest goodies. I am not running an Active-Active type VSAN Stretched Cluster (more on that here) in my work lab so I need to deal with localised outages as efficiently as I can.
It was late afternoon on the Friday and I received a reminder of the impending power outage. I quickly took a screenshot of the current lab as it was running at that time and decided to ease my way into the weekend and let fate take over.
Lab Screenshot – Prior to shutdown (Friday 2pm)
Fast forward, Monday morning. I entered the server room to check my servers had powered on. I sat down at my desk and checked to ensure vCenter had come back up as it is configured to do. I logged into the vSphere Web Client to find the environment in perfect health and my virtual machines in a healthy state.
Lab Screenshot (Monday 10am)
I also used the VSAN Health Plugin to check the data and object health for any weirdness. Everything seemed compliant and healthy. Happy Days! Just to be sure you can see the timeline in the Events tab.
So with VSAN we are able to provide crash consistency for the virtual machines and all IO that has reached the cache tier is held persistently until the cluster comes back to a normal health status and de-staging may be required. I logged in quickly to a few Windows and Linux guest machines to see if they were all healthy and discovered no issues.
As you can see, VSAN worked as designed and as I stated before this has happened on several occasions over the course of the last 2 years and I am yet to experience a problem on return, this time I thought I’d share it with you. It’s actually pretty boring, but I’d argue that’s the best way to deal with major power outages and/or hardware failures.
The best thing I could point you to for understanding failure impact and troubleshooting on VSAN is a very comprehensive document we published called the VSAN Troubleshooting Reference Guide found here VSAN Troubleshooting Reference Manual – VMware.