There are many great advantages that are offered by virtualizing your infrastructure and running virtual resources to serve out business-critical workloads. In the case of VMware vSphere, it provides many notable features and capabilities that provide high-availability in the environment as well as automated workload scheduling to ensure the most efficient use of hardware and resources in your vSphere environment.
In this post, we will talk about two of the core cluster-level features of vSphere in the enterprise – vSphere HA and DRS. You have most likely seen both of these referenced along with running vSphere in the enterprise.
What is vSphere HA and DRS? What do they do?
How do you benefit by running both in your vSphere environment?
Let’s take a look at a basic introduction to HA and DRS in VMware vSphere and see how they compare and the benefits of using them.
VMware vSphere Clusters
One of the obvious advantages and best practices when utilizing VMware vSphere to run business-critical workloads is to run a vSphere Cluster.
What is a vSphere Cluster?
A vSphere cluster is a configuration of more than one VMware ESXi server aggregated together as a pool of resources contributed to the vSphere cluster. Resources such as CPU compute, memory, and in the case of software-defined storage like vSAN, storage, are contributed by each ESXi host.
Why is running your business-critical workloads on top of a vSphere Cluster important?
When you think about the advantages provided by running a hypervisor, it allows more than one server to run on top of a single set of physical hardware. Virtualizing workloads in this way provides many efficiency benefits in orders of magnitude when compared to running a single server on a single set of physical hardware.
However, this can also become the Achilles heel of a virtualized solution, since the impact of a hardware failure can affect many more business-critical services and applications. You can imagine that if you only have a single VMware ESXi host running many VMs, the impact of losing that single ESXi host would be immense.
This is where running multiple VMware ESXi hosts in a vSphere Cluster really shines.
However, you may ask yourself, how does simply running multiple hosts in a cluster enhance your high-availability? How does a host in the vSphere Cluster “know” if another host has failed? Is there a special mechanism that is used to take care of managing the high-availability of workloads running on a vSphere Cluster? Yes, there is. Let’s see.
What is HA in VMware?
VMware realized the need to have a mechanism to be able to provide protection against a failed ESXi host in the vSphere Cluster. With this need, VMware High-Availability (HA) was born.
VMware vSphere HA delivers the following benefits:
VMware vSphere HA is cost-effective and allows automated restarts of VMs and vSphere hosts when there is a server outage or an operating system failure detected in the vSphere environment
Monitors all VMware vSphere hosts & VMs in the vSphere Cluster
Delivers high-availability to most applications running in virtual machines regardless of the operating system and applications.
The beauty of VMware’s vSphere HA solution that is implemented via the VMware Cluster is the simplicity for which it can be configured. With a few clicks through a wizard-driven interface, high-availability can be configured. How does this compare with traditional “clustering” technologies?
Windows Server Failover Clustering Comparison
Windows Server Failover Clustering (WSFC) has become the clustering technology that most think of when they have clustering technology in mind. The problem seen with WSFC is that it requires a lot of specialized expertise to run WSFC services correctly, especially when it comes to upgrades, patching, and general operational tasks.
Contrasting vSphere HA with WSFC, the operational overhead is minimal by comparison with WSFC. There is little chance that HA can be configured incorrectly as it is either enabled on a cluster or not. With WSFC, there are many considerations that need to be made when configuring WSFC to avoid both configuration and implementation mistakes. Think about the following:
- Failover clustering requires applications that support clustering (SQL, etc)
- Failover clustering requires quorum is configured correctly
- Not supported by many legacy operating systems and applications
- Requires complexity of cluster network names, resources, and networking
Windows Server Failover Clustering is advertised to provide near zero-downtime at the application level. However, when you add in the expertise required for a properly functioning HA solution, along with the proper implementation of WSFC, the risks can begin to outweigh the benefits of using WSFC for high-availability of applications and services. This is especially true when for most organizations who may not truly need a “zero downtime” solution. Additionally, your application has to be designed to take advantage of WSFC and work properly with WSFC technology.
While vSphere HA does require a restart of the virtual machines on a healthy host when a failover occurs, it requires no installation of additional software inside the guest virtual machines, no complex configurations of additional clustering technologies, and applications or OS’s do not have to be designed to work with particular clustering technology.
Legacy operating systems and applications generally have limited abilities when it comes to supported technologies to provide high-availability. So, there literally may be no native options to provide failover functionality in the case of hardware failures.
The vSphere HA high-availability mechanism works and is simple to implement, configure, and manage. Additionally, this is a technology that is well tested in thousands of VMware customer environments, so it has a stable and long history of successful deployments.
General Overview of vSphere HA Behavior
By using the benefits provided to the ESXi hosts in a vSphere Cluster, in its most basic form, vSphere HA implements a monitoring mechanism between the hosts in the vSphere Cluster. The monitoring mechanism provides a way to determine if any host in the vSphere Cluster has failed.
In the infographic below, a two-node vSphere Cluster has experienced a failure of one of the ESXi hosts in the vSphere Cluster. The vSphere Cluster has vSphere HA enabled at the cluster level.
After vSphere HA recognizes that a host in the vSphere Cluster has failed, the HA process moves the registration of VMs from the failed host over to a healthy host.
After the VMs are registered on a healthy host, vSphere HA restarts all the VMs of the failed host on a healthy ESXi host in the cluster where the VMs were reregistered. The only downtime incurred is with the restart of the VMs on a healthy host in the vSphere Cluster.
VSphere HA Technical Overview
Prerequisites for vSphere HA
You may wonder what underlying prerequisites may be required in order for vSphere HA to work. To you simply need a VMware Cluster to enable HA? Unlike Windows Server Failover Clustering, there are only a few requirements that need to be in place for HA to work.
- At least two ESXi hosts
- At least 4 GB of memory configured on each host
- vCenter Server
- vSphere Standard License
- Shared storage for VMs
- Pingable gateway or another reliable network node
If you notice, there is no quorum component required, no complex network naming involved, and no other special cluster resources that need to be in place.
VMware vSphere HA Master vs Subordinate Hosts
When you enable vSphere HA on a cluster, a particular host in the vSphere Cluster is designated as the master of vSphere HA. The remaining ESXi hosts in the vSphere Cluster are configured as subordinates in the vSphere HA configuration.
What role does the vSphere HA ESXi host that is designated as the master play? The vSphere HA master node:
- Monitors the state of the slave subordinate hosts – If the subordinate host fails or is unreachable, the master host identifies which VMs need to be restarted
- Monitor the power state of all VMs that are protected. If a VM fails, the master vSphere HA node ensures the VM is restarted. The vSphere HA master decides where the VM restart takes place (which ESXi host).
- Keeps track of all the cluster hosts and VMs that are protected by vSphere HA
- Is designated as the mediator between the vSphere Cluster and vCenter Server. The HA master reports the cluster health to vCenter and provides the management interface to the cluster for vCenter Server
- Can run VMs themselves and monitor the status of VMs
- Stores protected VMs in cluster datastores
vSphere HA Subordinate Hosts:
- Run virtual machines locally
- Monitor the runtime states of the VMs in the vSphere Cluster
- Report state updates to the vSphere HA master
Master Host Election and Master Failure
How is the vSphere HA master host selected? When vSphere HA is enabled for a cluster, all active hosts (no maintenance mode, etc) participate in electing the master host. If the elected master host fails, a new election takes place where a new master HA host is elected to fulfill that role.
VMware vSphere HA Cluster Failure Types
In a vSphere HA enabled cluster, there are three types of failures that can happen to trigger a vSphere HA failover event. Those host failure types are:
- Failure – A failure is intuitively what you think. A host has stopped working in some form or fashion due to hardware or other issues.
- Isolation – The isolation of a host generally happens due to a network event that isolates a particular host from the other hosts in the vSphere HA cluster.
- Partition – A partition event is characterized by a subordinate host losing network connectivity to the master host of the vSphere HA cluster.
Heartbeating, Failure Detection, and Failure Actions
How does the master node determine if there is a failure of a particular host?
There are several different mechanisms the master node uses to determine if a host has failed:
- The master node exchanges network heartbeats with the other hosts in the cluster every second.
- After the network heartbeat has failed, the master host checks for host liveness check.
- The host liveness check determines if the subordinate host is exchanging heartbeats with one of the datastores. Then it sends ICMP pings to its management IP addresses
- If direct communication with the HA agent of a subordinate host from the master host is not possible and the ICMP pings to the management address fail, the host is viewed as failed and VMs are restarted on a different host.
- If it is found that the subordinate host is exchanging heartbeats with the datastore, the master host assumes the host is in a network partition or is network isolated. In this case, the master simply monitors the host and VMs
- Network isolation is the event where a subordinate host is running, but can no longer be seen from an HA management agent perspective on the management network. If a host stops seeing this traffic, it attempts to ping the cluster isolation addresses. If this ping fails the host declares it is isolated from the network
- In this case, the master node monitors the VMs that are running on the isolated host. If the VMs power off on the isolated host, the master node restarts the VMs on another host
As mentioned above, one of the metrics used to determine failure detection is datastore heartbeating. What is this exactly? VMware vCenter selects a preferred set of datastores for heartbeating. Then, vSphere HA creates a directory at the root of each datastore that is used for both datastore heartbeating and for keeping up with the list of protected VMs. This directory is named .vSphere-HA.
There is an important note to remember regarding vSAN datastores. A vSAN datastore cannot be used for datastore heartbeating. If you only have a vSAN datastore available, there can be no heartbeat datastores used.
- VM and Application Monitoring
Another extremely powerful feature of vSphere HA is the ability to monitor individual virtual machines via VMware Tools and restart any virtual machines that fail to respond to VMware Tools heartbeats. Application Monitoring can restart a VM if the heartbeats for an application that is running are not received.
- VM Monitoring – With VM Monitoring, the VM Monitoring service uses VMware Tools to determine if each VM is running by checking for both heartbeats and disk I/O generated by VMware Tools. In the event these checks fail, the VM Monitoring service determines most likely the guest operating system has failed and the VM is restarted. The additional disk I/O check helps to avoid any unnecessary VM resets if VMs or applications are still functioning properly.
Application Monitoring – The application monitoring function is enabled by obtaining the appropriate SDK from a third-party software vendor that allows setting up customized heartbeats for the applications to be monitored by the vSphere HA process. Much like the VM Monitoring process, if application heartbeats stop being received, the VM is reset.
Both of these monitoring functions can be further configured with monitoring sensitivity and also maximum per-VM resets to help to avoid resetting VMs repeatedly for software or false positive errors.
VMware vSphere HA is a great way to ensure that your vSphere Cluster provides very resilient high-availability to protect against general host failures of ESXi hosts in your vSphere Cluster.
What about ensuring efficient use of resources in your vSphere Cluster? Let’s take a look at the next vSphere Cluster provision to help ensure efficient use of your vSphere Cluster resources and capacity.
What is DRS in VMware?
VMware Distributed Resource Scheduler (DRS) is a really powerful feature when running vSphere Clusters. It provides scheduling and load balancing across a vSphere Cluster. VMware DRS is the feature found in vSphere Clusters that ensures that virtual machines running inside your vSphere environment are provided with the resources they need to run effectively and efficiently.
VMs are generally subject to DRS early in life as from their first power on in a DRS-enabled cluster, DRS places the VMs on the best host configured to provide the required resources to the VM as soon as they are powered on. Additionally, DRS strives to keep vSphere clusters balanced from a resource usage perspective.
Even if a vSphere Cluster is balanced at a certain point in time, VMs may get moved around or change in such a way that an imbalance of cluster resources can creep back into the environment. When clusters become imbalanced, it can be detrimental to the overall performance of virtual machines running in a vSphere Cluster.
By default, DRS runs automatically on a vSphere cluster every five minutes to determine the balance of a vSphere Cluster and see if any changes need to be made to make more effective use of the resources.
VMware DRS Requirements
To take advantage of VMware DRS, there are several requirements that need to be met to ensure taking advantage of the Distributed Resource Scheduler functionality. These include:
- A cluster of ESXi hosts
- vCenter Server
- Enterprise Plus License
- vMotion is required for automatic load balancing
Read More: How to Configure a vSphere DRS Cluster
VMware DRS Actions
When VMware DRS runs on a vSphere Cluster every five minutes, it determines if there are any imbalances that exist in the cluster. If so, a vMotion will be performed to move designated VMs from one ESXi host to another.
How exactly does DRS determine if virtual machines are better suited on one ESXi host or another?
DRS runs a special algorithm to determine the right ESXi host that should house a particular VM. When a VM is powered on, this algorithm takes into consideration the resource distribution across the vSphere Cluster after it ensures there are no constraint violations if a particular VM is placed on a particular ESXi host.
Additionally, the demand of the VM itself is taken into consideration so the VM will hopefully never be starved for resources when it is powered on. What is included in VM demand? A VM’s demand includes the amount of resources needed to run.
- For CPU demand, this is calculated based on the amount of CPU the VM is currently consuming
- For memory, demand is calculated based on the formula: VM memory demand = Function(Active memory used, Swapped, Shared) + 25% (idle consumed memory). This shows the DRS memory balance is based mainly on a VM’s active memory usage while considering a small amount of its idle consumed memory as a cushion for any increase in workload.
DRS Automation Levels
One of the interesting features of DRS is the DRS automation levels. While DRS continues to scan the vSphere Cluster and provide recommendations every 5 minutes, you can determine whether or not DRS is able to enact its recommendations automatically or only suggest changes that should be made. DRS has three DRS automation levels. These include:
- Fully automated – In the fully automated approach, DRS applies both the initial placement and load balancing recommendations automatically
- Partially Automated – With partial automation, DRS applies recommendations only for initial placement of VMs
- Manual – In manual mode, you must apply the recommendations for both initial placement and load balancing recommendations
DRS Migration Thresholds
DRS includes another very useful setting to control the amount of imbalance that will be tolerated before DRS recommendations will be made. There are five DRS migration thresholds to control the amount of imbalance tolerated.
The range is 1 (most conservative) to 5 (most aggressive).
With more aggressive settings, DRS tolerates less imbalance in a cluster. The more conservative, the more DRS tolerates imbalance.
VMware DRS VM/Host Rules
There is an extremely useful feature found when using VMware DRS to control the placement of VMs in your vSphere DRS-enabled clusters. The VM/Host Rules allow you to run specific VMs on specific ESXi host. You can think of this as affinity rules in a way.
The VM/Host rules allow you to:
- Keep virtual machines together
- Separate virtual machines
- Tie Virtual Machines to specific hosts
- Tie Virtual Machines to virtual machines
Shown below is an example of creating a VM/Host rule for virtual machines and ESXi hosts.
What type of use case exists for these VM/Host rules? One of the classic use cases that exist is with domain controllers. Generally speaking, if you are running all of your domain controllers in a virtualized environment such as a vSphere Cluster, you want to make sure you have your domain controller virtual machines separated from one another inside the cluster. In this way, if you have an ESXi host go down along with one of your domain controllers, you still have a domain controller that is subject to a Separate Virtual Machines rule that is keeping it off the same host as another DC.
VM Overrides for DRS
The vSphere Cluster provides great granularity for operations affecting individual VMs inside the vSphere Cluster. You can create VM Overrides to override global settings set at the cluster level for HA and DRS to define more specific settings for each individual VM.
CPU and Memory Utilization Summary
DRS provides a great high-level view of the CPU utilization summary of the CPU resources of ESXi hosts in the vSphere Cluster. Navigate to
The same high-level overview can be viewed for memory consumption as well. Navigate to
The Best of Both Worlds
Is VMware vSphere HA and VMware DRS competing technologies?
No, they are not. In fact, it is highly recommended to use both vSphere HA and VMware DRS together to combine automatic failover with load balancing features and functionality. This results in a much more resilient and more balanced vSphere environment.
If failure of an ESXi host occurs, vSphere HA will restart the VMs on the remaining healthy hosts in a vSphere Cluster. So, the first priority, of course, is the availability of virtual machine resources. VMware DRS will then run and determine if any imbalance exists between the ESXi hosts running the workloads and will make recommendations to resolve any imbalances in the cluster based on the configured migration threshold. Based on the automation level, these recommendations will either be automatically actioned or only recommended if not fully automated.
Final Thoughts on VMware vSphere HA and DRS
Running both VMware vSphere HA and DRS are highly recommended in a production vSphere Cluster. Using both technologies helps to make your workloads highly-available and ensures they continuously have the resources required based on the CPU/memory demands of the VM.
Understanding how both mechanisms work helps you as a vSphere administrator leverage both technologies in the best way possible and in accord with best practices. Among the benefits that both technologies bring to the table, each feature is extremely easy to enable and configure. With a few simple clicks in the properties of your vSphere Clusters, you can quickly start benefiting from these available cluster-level features.