Back then when vSphere 6.5 was released, a bunch of new options were introduced in the DRS configuration page. These options are very interesting but it is easy to miss them due to old habits of configuring a cluster. You will find these options in Cluster > Configure > vSphere DRS.
The purpose of DRS has always been to ensure that virtual machines are entitled the resources they need. Imagine a cluster where there are 15 VMs on one ESXi host and 0 on the other ones. If these 15 VMs are getting the resources they need, DRS will not vMotion any VM to the available capacity in the cluster. Many customers were frustrated by it. One server could wear out quicker than others and failure domains would be greater. VMware implemented a parameter that will make DRS spread the virtual machines evenly across the hosts. Note that DRS will always prioritize resource allocation to VM distribution.
This setting will be a great addition to customers that are juggling between Migration Threshold and failure domain size.
Memory Metric for Load Balancing
DRS has always used the misunderstood Active Memory metric which uses a sampling mechanism to estimate and reports the amount of memory pages that have been touched in a time interval (More info here). Using this metric is fine, especially if you over-commit the memory. However, if like many customers you prefer the safe side and provision less memory than available on the host, you would be better off using the consumed memory (pretty much the allocated memory) for a better distribution of the provisioned memory across the hosts. It will save you from having VMs, with large amount of data cached in RAM but not touched in the sampling interval, to be moved to a host with less memory available resulting in memory contention. The “DRS active memory paradox” as I call it (just came up with it).
Note that you should not enable this setting if you over-commit the memory as you may run into serious memory contention issues.
When sizing hosts or troubleshooting CPU performance issues, one of the first things to look at is the sum of the number of virtual CPU (virtual cores of a VM) provisioned on a host compared to the number of physical core available on the host (Hyper Threading excluded). For instance if you have 10 VMs with 4 vCPUs each running on a 10 core CPU, the ratio will be 4:1 (4 to 1), or 400%.
The new CPU over-commitment setting allows you to set a maximum value in percent for this ratio at the cluster level of up to 500% (5:1). Once this threshold is reached, you will not be able to power on any more virtual machines. Some kind of addition to Admission Control in a way.
I find the maximum value of 500% quite conservative but it will very much depend on the type of workloads running in the cluster.