Storage virtualization has been around for many years now and is only getting more and more traction as it goes. VMware’s vSAN is no exception with more than 14 000 customers worldwide. By leveraging local storage to create a shared datastore, it removes the complexity of designing and managing standalone storage systems made of the storage array and SAN fabrics. It is also a lot easier to set up and configure while providing great performances in both hybrid and full flash deployments thanks to its caching tier.
However, one of the traps of such environment is that administrators may be tempted to treat a vSAN enabled node like any other traditional vSphere host. However, extra care should be taken when dealing with hyper-converged nodes, you are now dealing with a member of the shared storage accessible by all the nodes in the cluster! Its purpose is now also to store the vSAN objects that are spread across the participating nodes of the cluster. When any member host of a vSAN cluster enters maintenance mode, the cluster capacity is automatically reduced, because the member host no longer contributes capacity to the cluster.
This is why, as opposed to standard vSphere host that you can place into maintenance mode without giving it too much consideration, you will need to choose one of the three data evacuation modes when dealing with a vSAN enabled node. It is important to understand what they do and in which case you might want one or another.
Evacuate all data to other hosts
This option is pretty straightforward in how it works and is pretty self-explanatory.
- It is the recommended mode when you want to take a host out of the cluster for an extended period of time or permanently decommission it
- vSAN evacuates all data to other hosts in the cluster, maintains or fixes availability compliance for the affected components, and protects data when sufficient resources exist in the cluster. The host cannot enter maintenance mode if all data is not evacuated
- It is the most time consuming and resources intensive in terms of storage I/Os and network activity
- All VMs are still in compliance with their storage policy after the host in maintenance mode (i.e. VM BA and B in exhibit 1)
Ensure data accessibility from other hosts
This option is the default one because it matches most scenarios.
- It is recommended when you want to take a host out of the cluster temporarily for various reasons like host patching, DIMM replacement, standard reboot, you name it. It is good if you plan on getting the host back in the cluster fairly quickly. It is not appropriate if you want to remove the host permanently or if it is going to be down for several days for instance as your now-non-compliant workloads are vulnerable to a host failure
- Only partial data evacuation is required to allow running virtual machines to retain access to their storage even with the host down
- Some virtual machines may no longer be compliant with their storage policy after the evacuation (i.e. VM B in exhibit 2). How many will depend on the size of the failure domains, the number of hosts in the cluster, the distribution of vSAN objects… All objects with Primary Level of Failures to Tolerate (PFTT) = 0 will still be compliant (i.e. VM A in exhibit 2)
- The Maintenance mode window will tell you how much data needs to be moved and how many objects will become non-compliant. Should data be moved, it will also tell you if there is enough space on the remaining hosts to move the data to
- By default, vSAN will wait 60 minutes from the moment the host enters maintenance mode before starting a resynchronizations task to bring the VMs back to compliance with their storage policies. Note that this setting can be changed if you plan for more than 60 minutes of downtime for example. However, it is best to change it back to the default 60 when you’re done. Also, note that some improvements have been made in the latest release of vSAN 6.7 Update 1 which are detailed further down in this blog
No data evacuation
Again this option is pretty self-explanatory but should be used with great caution.
- vSAN does not evacuate any data from the host
- All object that has a PFTT = 1 or more will remain accessible (i.e. VM B in exhibit 3)
- All object that has a PFTT = 0 and residing locally on this host will become inaccessible (i.e. VM A in exhibit 3). This is where you should be careful as it might affect running virtual machine. The maintenance mode UI will warn you beforehand of how many objects will become inaccessible
- This is the fastest method of placing a host in maintenance mode
vSAN 6.7 Update 1 – Maintenance mode improvements
I thought I would add a quick chapter about this as the latest release brought a few nice improvements to the maintenance mode in vSAN. You can find more info about it as well as the known issues in the vSAN 6.7 Update 1 release notes
EMM Pre-Check Simulation
vSAN will run a full simulation of data movement as part of the validation process of entering maintenance mode. This simulation will tell you if the EMM (Enter Maintenance Mode) task will succeed or fail. Meaning you won’t have to find out that it failed after moving 2TB of data and putting the strain on the other nodes.
Object Repair Timer
By default, vSAN will start the resynchronization of all the objects after 60 minutes of the host being in maintenance mode, however, it might take you 62 minutes and then you’re off to a rebuild of all the non-compliant objects. This one is my personal favourite. Because I never know if whatever maintenance I am doing on a host will take less than 60 minutes. I don’t want to trigger unnecessary rebuild so I like to be on the safe side and I usually change the rebuild timer to something like 400 minutes (of course I change it back to 60 when I’m done). The way to do it is via esxcli so I have a small PowerCLI function to change the rebuild timer quickly but it is not always very intuitive for folks that don’t use PowerCLI that much. And let’s be honest you don’t want SSH on each host every time you want to change it…
In vSAN 6.7 Update 1, it is possible to change this setting in the UI: vSAN > Services > Advanced Options. As you can see the setting is cluster-wide which is great as it will change it on all the hosts for you.
Although, I have to say that I would like to have the possibility to temporarily change the rebuild delay in the Enter Maintenance Mode UI, maybe in Update 2?
Canceling Maintenance Mode
vSAN 6.7 Update 1 improves the efficiency of the cancelation of an EMM task. In previous releases, you could get unexpected resync behaviors and the pending resyncs of the EMM task would not get cancelled.
vSAN clusters have a lot more going on than “normal” ones and it can become overwhelming at first to learn how to tame the beast. Like the great Duncan Epping keeps repeating it, it cannot be stressed enough that it is very important to treat vSAN hosts with a different approach than regular vSphere hosts. You are not managing dumb compute providers anymore. You are managing a shared storage solution which brings its own challenges into the pot. While placing a vSphere host in maintenance mode is a case of migrating a few VMs using vMotion and perhaps storage vMotion if you have local datastores, vSAN nodes require that you decide what to do with the data and take into consideration the duration of the maintenance window.
However, as complicated or not as it may look, VMware improved the platform a great deal since its early versions to make it easier and safer to use.