Rebuild / Resynchronization time is one of the things I have seen jump up and bite customers on the ankle over and over… When you distribute your storage across multiple hosts in a hperconverged cluster, the arithmetic and physics of migrating or copying data between hosts plays a critical role in the daily operations of that cluster. Putting hosts in maintenance mode or recovering from a failure both create circumstances that may require the cluster to move or copy large amounts of data between cluster hosts – and the length of time it takes to do so can have a material impact on the perceived efficiency or supportability of the solution.
The following recommendations are intended to increase both the performance and the resilience of the target vSAN cluster, either by decreasing resynchronization times (and as a result the time required to enter maintenance mode), or by directly reducing the amount of data that is required to be migrated as a result of vSphere hosts entering maintenance mode. It should be noted that any of the changes suggested below should have a positive effect on the target environment – however testing is necessary to understand the exact magnitude of the impact.
See the following VMware KB article for official recommendations:
Heavy resync traffic may cause VM IO performance degradation (2150101)
Increasing Data Ingest Rate
One factor that can hamper rebuild or resynchronization is a host bottleneck ingesting data. This is initially handled by the vSAN caching tier, and how vSAN governs thresholds for congestion. If the resynchronization activity fills the cache tier and reaches a point that may begin to impact actual VM i/o, vSAN will throttle resynchronization activity between hosts in order to preserve production workload i/o performance (see the article above for a brief discussion on this topic). This is handled slightly differently across vSAN versions, and with vSAN 6.6 the administrator may manually set a resynchronization throttle. Whatever version you are running, configuring your cluster to be able to ingest data as quickly as possible will benefit both regular VM i/o as well as rebuild/resynch times.
Changes to Increase Data Ingest Rates
Increase the Size and Speed of Caching Devices
By selecting larger / faster SSD for caching device(s), vSAN can ingest more data before reaching congestion thresholds. The larger the caching device, the more data the disk group can ingest. Therefore, a 200 GB caching drive will be able to ingest more than a 100GB caching device, a 400 GB drive more than a 200, and so on. It should be noted that vSAN allocates up to 600 GB of the caching device for ingesting writes, so if your caching drive is already 600GB or larger there won’t be any performance benefit (See vSAN Write Caching on StorageHub)… However larger drives will increase your caching lifespan.
Increase Disk Groups per Host
Additional disk groups per host means additional write cache and additional capacity drives per host, both of which will result in better synchronization times.
Increasing “Cache Drain” Performance
Once the vSAN host has captured the writes / data from the resynchronization on the caching device(s), that data must be moved to the capacity tier to make room for additional writes to the caching device as part of the job. In other words, that data must be ‘drained’ from the caching tier to the capacity tier. There are straightforward techniques to increase the drain rate.
Increase the Number and Speed of Capacity drives in each disk group
By increasing the number of drives in the group, there are more devices to assist in the drain operations, which will result in better resynchronization times. If there is only a single virtual machine object in the process of resynchronization, then only a single capacity drive is necessary, and therefore additional capacity drives in the group will not help. However, if there are multiple virtual machine objects being migrated / resynchronized, then additional capacity drives may in fact help drain those additional objects from the caching tier.
Leverage a stripe width as part of the storage policy
Striping objects as part of the storage policy will leverage additional devices per object which should result in better resynchronization times. See vSAN Stripe Width on VMware StorageHub for further discussion on this topic.
Reduce the Amount of Data to Be Migrated
This may be cheating a little bit, but if the goal is to reduce the time it takes to go into maintenance mode, in addition to the techniques above, another option is to examine your approach to maintenance mode.
It is understandable that you may want to decrease the risk of data loss while in maintenance mode (MM)… After all, if you choose either “Ensure Accessibility” or “No data migration” then you may be choosing to deliberately take a copy of your data offline (see vSAN Maintenance Mode Options on VMware StorageHub). Therefore, many customers may choose “Full Data Migration” to be sure that there is no risk of data loss during maintenance operations.
However, selecting “Full Data Migration” forces the host to fully evacuate all data to other hosts in the cluster before it can successfully enter MM. If this is your choice, then the options discussed above should help increase the speed of data migration and therefore decrease the time required to enter MM.
However, it is possible to have your cake and eat it too! Consider instead configuring the cluster to support a Failure to Tolerate (FTT) setting of 2…
5-Host cluster (FTT=2/RAID1)
Running a cluster with 5 hosts will give you the ability to utilize a policy allowing FTT=2 with RAID 1. This will result in 3 full copies of data for each virtual machine attached to the policy. Leveraging such a policy will allow you to put a host in maintenance mode while choosing “Ensure Accessibility” rather than “Full Data Migration.” This effectively reduces the amount of data required to be migrated as a result of entering maintenance mode. Note for full evacuation with FTT=2 / RAID1 the cluster would require 6 hosts.
6-Host Cluster (FTT=2/RAID6)
Running a cluster with 6 hosts will give you the ability to utilize a policy allowing FTT=2 with RAID 6. This will result in 1 full copies of data (plus 2 parity bits for each stripe) for each virtual machine attached to the policy. Leveraging such a policy will allow you to put a host in maintenance mode while choosing “Ensure Accessibility” rather than “Full Data Migration.” This effectively reduces the amount of data required to be migrated as a result of entering maintenance mode. Also note for full evacuation with FTT=2 / RAID6 the cluster would require 7 hosts.
Of course, you can continue to maintain policies with FTT=1 and to choose “Ensure Accessibility” when putting your host in maintenance mode. This will still effectively reduces the amount of data required to be migrated as a result of entering maintenance mode. However, VMware highly encourages the use of appropriate backup software to ensure the ability to recover in the event of an unrecoverable hardware failure.
As a follow-on discussion, Pete Koehler (VMware Storage and Availability Technical Marketing) delves more deeply into some of the factors around the cache tier that may impact performance, especially when considering the use of newer SSD technologies – https://blogs.vmware.com/virtualblocks/2019/10/01/write-buffer-sizing-vsan/.