Failover Clustering in Windows Server

Microsoft’s Hyper-V is a highly available and resilient hypervisor that allows enterprise datacenters to run production workloads with little downtime and tremendous flexibility. Hyper-V’s high availability and flexibility in running production workloads is made possible by the underlying clustering technology. Hyper-V clusters are built on top of Microsoft’s Windows Server Failover Clustering technology. Windows Server Failover Clustering provides the technical capabilities to allow Hyper-V to have the clustering and migration capabilities that allow production workloads to be resilient, mobile, and highly-available. It also allows for other benefits.

In this post, we will take a look at Hyper-V Failover Cluster Basics to discuss the Failover Clustering concepts that need to be understood for designing a well thought out Hyper-V cluster configuration.

Protect Your Data with BDRSuite

Cost-Effective Backup Solution for VMs, Servers, Endpoints, Cloud VMs & SaaS applications. Supports On-Premise, Remote, Hybrid and Cloud Backup, including Disaster Recovery, Ransomware Defense & more!

Learn More

Additionally, we will look at the benefits of Windows Server Failover Clustering as relates to Hyper-V environments.

What is Windows Server Failover Clustering?

Windows Server Failover Clustering is the mechanism that allows running Windows Roles, Features, and applications to be made highly available across multiple Windows Server hosts.

Why is making roles, features, and other applications across multiple hosts important?

Clustering helps to ensure that workloads are resilient in the event of a hardware failure. Especially when thinking about virtualized workloads, often multiple virtual machines are running on a single host. If a host fails, it is not simply a single workload that fails, but possibly many production workloads could be taken offline with dense configurations of virtual machines on a single host.

The Windows Server Failover Cluster in the context of Hyper-V allows bringing together multiple physical Hyper-V hosts into a “cluster” of hosts. This allows aggregating CPU/memory resources attached to shared storage which in turn allows the ability to easily migrate virtual machines between the Hyper-V hosts in the cluster. The shared storage can be in the form of the traditional SAN or in the form of Storage Spaces Direct in Windows Server 2016.

The ability to easily migrate virtual machines between shared storage allows restarting a virtual machine on a different host in the Windows Server Failover Cluster if the original physical host in which the virtual machine was running on fails. This allows business-critical workloads to be brought back up very quickly even if a host in the cluster has failed.

Windows Server Failover Clustering also has other added benefits as they relate to Hyper-V workloads that are important to consider. In addition to allowing virtual machines to be highly available when hosts fail, the Windows Server Failover Cluster also allows for planned maintenance periods such as patching Hyper-V hosts. This allows Hyper-V administrators the ability to patch hosts by migrating virtual machines off a host, applying patches, and then rehydrating the host with virtual machines. There is also Cluster Aware Updating that allows this to be done in an automated fashion. Windows Server Failover Clustering also provides the benefit of protecting against corruption if the cluster hosts become separated from one another in the classic “split-brain” scenario. If two hosts attempt to write data to the same virtual disk, corruption can occur.

Windows Server Failover Clusters have a mechanism called quorum that prevents separated Hyper-V hosts in the cluster from inadvertently corrupting data. In Windows Server 2016, a new type of quorum has been introduced that can be utilized along with the longstanding quorum mechanisms – the cloud witness.

Windows Server Failover Clustering Basics

Now that we know what Windows Server Failover Cluster is and why it is important, let’s take a look at Windows Server Failover Clustering basics to understand a bit deeper how Failover Clustering in Windows Server works.

Windows Server Failover Clustering is a feature instead of a role as Windows Server Failover clustering simply helps Windows Servers accomplish their primary role.

It is also included in the Standard Edition version of Windows Server along with the Datacenter version. There is no feature difference between the two Windows versions in the Failover Clustering features and functionality. A Windows Server Failover Cluster is compromised of two or more nodes that offer resources to the cluster as a whole. A maximum of 64 nodes per cluster is allowed with Windows Server 2016 Failover Clusters. Additionally, Windows Server 2016 Failover Clusters can run a total of 8000 virtual machines per cluster. Although in this post we are referencing Hyper-V in general, Windows Server Failover Clusters can house many different types of services including file servers, print servers, DHCP, Exchange, and SQL just to name a few.

One of the primary benefits as already mentioned with Windows Server Failover Clusters is the ability to prevent corruption when cluster nodes become isolated from the rest of the cluster. Cluster nodes communicate via the cluster network to determine if the rest of the cluster is reachable. The cluster in general then performs a voting process of sorts that determines which cluster nodes have the node majority or can reach the majority of the cluster resources.

Quorum is the mechanism that validates which cluster nodes have the majority of resources and have the winning vote when it comes to assuming ownership of resources such as in the event of a Hyper-V cluster and virtual machine data.

This becomes glaringly important when you think about the case of an even node cluster such as a cluster with (4) nodes. If a network split happens that allows two of the nodes on each side to only see its neighbor, there would be no majority. Starting with Windows Server 2012, by default, each node has a vote in the quorum voting process. A file or share witness allows a tie-breaking vote by allowing one side of the partitioned cluster to claim this resource, thus breaking the tie. The cluster hosts that claim the disk or file share witness perform a SCSI lock on the resource, which prevents the other side from obtaining the majority quorum vote. With odd numbered cluster configurations, one side of a partitioned cluster will always have a majority so the file or share witness is not needed.

Quorum received enhancements in Windows Server 2016 with the addition of the cloud witness. This allows using an Azure storage account and its reachability as the witness vote. A “0-byte” blob file is created in the Azure storage account for each cluster that utilizes the account.

Windows Server Failover Clusters Hyper-V Specific Considerations

When using Windows Server Failover Clusters for hosting the Hyper-V role, this opens up many powerful options for running production, business-critical virtual machines. There are a few technologies to be aware of that specifically pertain to Hyper-V and other workloads. These are the following

Cluster Shared Volumes
ReFS
Storage Spaces Direct

Cluster Shared Volumes

Cluster Shared Volumes or CSVs provide specific benefits for Hyper-V virtual machines in allowing more than one Hyper-V host to have read/write access to the volume or LUN where virtual machines are stored. In legacy versions of Hyper-V before CSVs were implemented, only one Windows Server Failover Cluster host could have read/write access to a specific volume at a time. This created complexities when thinking about high availability and other mechanisms that are crucial to running business-critical virtual machines on a Windows Server Failover Cluster.

Cluster Shared Volumes solved this problem by allowing multiple nodes in a failover cluster to simultaneously have read/write access to the same LUN provisioned with NTFS. This allows the advantage of having all Hyper-V hosts provisioned to the various storage LUNs which can then assume compute/memory quickly in the case of a node failure in the Windows Server Failover Cluster.

ReFS

ReFS is short for “Resilient File System” and is the newest file system released from Microsoft speculated to be the replacement for NTFS by many. ReFS tout many advantages when thinking about Hyper-V environments. It is resilient by nature, meaning there is no chkdsk functionality as errors are corrected on the fly.

However, one of the most powerful features of ReFS related to Hyper-V is the block cloning technology. With block cloning, the file system merely changes metadata as opposed to moving actual blocks. This means the operation is almost instantaneous with ReFS whereas on NTFS the typical I/O intensive operations such as zeroing out a disk as well as creating and merging checkpoints take place.

ReFS should not be used with SAN/NFS configurations however as the storage operates in I/O redirected mode in this configuration where all I/O is sent to the coordinator node which can lead to severe performance issues. ReFS is recommended however with Storage Spaces Direct which does not see the performance hit that SAN/NFS configurations do with the utilization of RDMA network adapters.

Storage Spaces Direct

Storage Spaces Direct is Microsoft’s software-defined storage solution that allows creating shared storage by using locally attached drives on the Windows Server Failover Cluster nodes. It was introduced with Windows Server 2016 and allows two configurations:

Converged
Hyper-converged

With Storage Spaces Direct you have the ability to utilize caching, storage tiers, and erasure coding to create hardware abstracted storage constructs that allow running Hyper-V virtual machines with scale and performance more cheaply and efficiently than using traditional SAN storage.

Concluding Thoughts

Windows Server Failover Clusters provide the underlying technology that allows the Hyper-V role to be hosted with high availability and redundancy. There are basic points to note with Windows Server Failover Clustering including a need to understand the concepts of quorum which prevent corruption due to partitioned clusters or the traditional “split-brain” scenario. There are specific technologies that relate to Hyper-V workloads including Cluster Shared Volumes or CSVs, ReFS file system, and Storage Spaces Direct introduced with Windows Server 2016. Understanding and considering the basic with Windows Server Failover Clusters with Hyper-V is crucial to architecting a performant and stable Hyper-V environment.

Follow our Twitter and Facebook feeds for new releases, updates, insightful posts and more.

Rate this post

Windows Server Failover Cluster Hyper-V Basics

What is Windows Server Failover Clustering?

Windows Server Failover Clustering Basics

Windows Server Failover Clusters Hyper-V Specific Considerations

Cluster Shared Volumes

ReFS

Storage Spaces Direct

Concluding Thoughts

About the Author: Brandon Lee

Windows Server Failover Cluster Hyper-V Basics

What is Windows Server Failover Clustering?

Windows Server Failover Clustering Basics

Windows Server Failover Clusters Hyper-V Specific Considerations

Cluster Shared Volumes

ReFS

Storage Spaces Direct

Concluding Thoughts

Share This Story, Choose Your Platform!

About the Author: Brandon Lee

Subscribe for Blog Updates