High availability is a term that is thrown around every time someone mentions the need for reduced downtime. Before we look into what it is, we must know what it is not: It is not a technology.
It is the end benefit.
Businesses sometimes have servers that are running an application, that could never go offline. The application must run throughout the year with downtime less than a total of 5 minutes.
In such cases, the application is said to have an uptime rating of 5nines (99.999%). This server is then categorised as highly available. The uptime rating differs across businesses. Depending upon the availability requirement of the organization, a high availability metric is fixed.
So, the objective becomes evident. Keep the servers running as long as possible with as little downtime as possible.
What is Hyper-v Failover Cluster?
Since avoiding downtime is the primary target, recovering the server after a disaster is out of question. There needs to be a plan in place that allows for immediate failover of the critical server for business continuity purposes. This is done by linking a set of servers (also called as nodes) with one another so that, in case of shutdown of any cluster node, then all the cluster resources running in that node will be migrated to any one of the other nodes with minimal downtime.
Microsoft does not have servers that are specifically designed for clusters. You have to look into the best practices to avoid costly mistakes during crucial moments of failover. They are basic, though.
- Before jumping into networking requirements, confirm whether your server supports Cluster role. 2008/2008 R2/2012/2012 R2 and 2016 servers support this while the only x64 based processors work
- Add to that, some generic tips like having processors that have the same manufacturer. You can take it up a notch, for best performance by having the nodes being a part of the same processor family
- To avoid network congestion issues, isolating the networks for different operations will be beneficial. Having distinct IPs for hosts, storage and communications between cluster services is said to be the gold standard of configurations
- In a Hyper-V cluster, a maximum of 64 nodes(host machines) can be included where about 4000 virtual machines per cluster and 1024 virtual machines is permitted
How do Hyper-v Clusters work?
On a high level, there are three simple steps.
- Enable Failover Cluster Role
- Create a Failover Cluster
- Join the nodes
But, if you do a little more than dipping your toe, you’ll find there are a few more essential characters involved.
Microsoft Cluster Services
At first, the Cluster Service is enabled in each node that is part of failover cluster. The Microsoft Cluster Service (MSCS) is the most critical element of the entire operation as it performs majority of the operations we generally think that happen during a Failover operation.
Some of the responsibilities of MSCS are
- Communication with the other nodes,
- Monitoring and notifying of a failure and
- Control of the Cluster Objects
Cluster Objects could be an element of a cluster – the network or storage or resource or interface. Cluster Services also perform the failover operations while maintaining a consistent image of the cluster across all the nodes. But, Cluster Services alone do not carry out the entire operation. Every node that is part of the cluster has to have a Resource DLL.
A request is sent by the CS to the Resource Monitor of the node when an operation is to be performed on the resource. The RM uses the registration information to use the DLL appropriate for the resource type. The request is then sent to DLL’s function, which handles the details of the task to deliver the needs of the resource.
The Resource DLL consists of the functions that can be performed by the Resource Host Subsystem (RHS). This RHS monitors the health of all the resources that are part of the cluster. During a server failure, the RHS will read the DLLs to know the functions available to bring the resources online on the destination nodes.
Before all this happens, the design of the cluster is generally done in a way that the cluster resources are able to work without further requirements after failover. For this to happen, when the group is created along with the cluster resource, all the resources this target depends upon are also selected as Dependent Resources. Similarly, Preferred Nodes are prioritized that the target resources will run on when failover begins.
There is one final character that is integral to how the failover begins in the first place- VMCLUSRES.DLL
Remember when you once enabled the Hyper-v role in your server? Behind the screens, a resource DLL called VMCLUSRES.DLL was installed between the RHS and the VM by the MSCS to interact with the VM resources running in the failover cluster. Its role is very simple. It tracks whether the critical VM is running or not by executing a function called VM IsAlive. This is executed on an interval and the status is sent to the MSCS.
The entire process goes something like this..
Initially, the VM’s properties are analysed by the MSCS to determine the interval for the VM IsAlive function. Once the interval is set, the VMCLUSRES.DLL executes the VM Status function when the interval expires and sends the status to the resource DLL. The resource DLL, after receiving the status of all the VMs, reports them to the RHS. This is then forwarded to the MSCS. If the VM is online, then the entire system stays put and the process begins again after the next interval expires. But, if the VM is reported to have failed, the Failover Cluster Manager restarts the VM on another node in the cluster.
There are only a limited number of failures that a failover cluster can handle or support. This can be defined by the user by Quorum configuration settings.
Failover Cluster Quorum
With this configuration, users can define the number of failures that can be handled by the cluster within a certain period of time. If this threshold limit is crossed, the failed node will remain in the failed state. The cluster stops working.
The most common problem that the Quorum solves is the Split-Brain situation. In a two-node cluster that has a network problem, communication goes down. Something has to prevent each of the nodes to take ownership of the disks on each side and operate independently. Quorum does this.
This is done by a simple voting algorithm. Each node has a vote. As long as a node is online, the vote stays up. Similarly, as long as the number of votes are more than half of the number of nodes in the cluster, Cluster works. If it drops below the threshold, the cluster stops.
All that we discussed are for Hyper-V server high availability. But what happens if it is the storage (instead of the VM) goes down this time. Cluster Shared Volume makes the difference here.
Cluster Shared Volume
In the same group, each node can be made accessible to a disk or a pool of disks as if it was a logical disk on the server. In the cluster, each node will be able to connect to the storage volume simultaneously. This common storage location for the VM disk and machine configuration can be passed to another node in the event of a failure, without manually mounting a volume or copying files.
The possibilities of data conflict or corruption that could occur when there are simultaneous connections to the same volume are handled with Coordinator node and Quorum Disk (similar to discussed above).
The Coordinator Node on the other hand is just a Cluster Node that is responsible for the coordination of the file access. Although its responsibility is not used often, it does help for faster copying of the VHD files to LUN when compared to the manual copying.
Vembu BDR suite offers the feature to get your cluster shared volumes backed up, so that your business critical data will stay alive forever. While, support for Hyper-v Cluster will be available in the upcoming release.
Kickstart the Backup of your Hyper-V VMs with Vembu BDR Suite
Experience modern data protection with this latest Vembu BDR Suite v.3.9.0 FREE edition. Try Now on a 30-days free trial