DR sites are commonly using kit that has other primary uses, or is older, re-used ex-production kit. Typically, your DR site is going to offer lower performance. This is key for storage. While most companies are willing to take a performance and productivity hit in DR, recovery time objectives (RTO) are typically something that they aren’t willing to miss – when every hour of downtime costs money, getting those systems back online is critical. When you have low performance available from the storage, managing the load an SRM failover generates is key to having a successful recovery.
DR sites are commonly using kit that has other primary uses, or is older, re-used ex-production kit. Typically, your DR site is going to offer lower performance. This is key for storage. While most companies are willing to take a performance and productivity hit in DR, recovery time objectives (RTO) are typically something that they aren’t willing to miss – when every hour of downtime costs money, getting those systems back online is critical. When you have low performance available from the storage, managing the load an SRM failover generates is key to having a successful recovery.
Stormy Weather
SRM generates a lot of simultaneous disk I/O in two ways early on in a recover plan:
- Suspending non-critical VMs at the recovery site – If DR is your Dev/Test environment and you need to suspend machines, they all get suspended at once and write their state information to disk. This loads up the storage right from the get go.
- If you have SRM update the IP addresses of your severs it does so by booting the servers, changing the IP addresses with VMtools and shutting them down. Despite appearances, it does all the servers simultaneously by design. Welcome to the boot storm!
While SRM provides a great amount of flexibility around the boot order of your servers using dependencies and priority groups, these are not obeyed for the IP change part of the workflow. Rest assured that once the IP customisation is complete your servers will come up in the order you specify, so you database applications will work if you set the priorities correctly.
Sheltering your array from the storm
Unfortunately, if you have to suspend a lot of machines, you just have to take the hit. At this point in time, you can’t control how many machines get suspended at a time or what the timeout is. By default the timeout is 900 seconds (15 mins), after that the plan throws an error for any remaining servers and moves on. The servers will continue to suspend in the background. On deployments with really slow storage, we typically add a prompt after the suspend step so the admin executing the plan can wait until all machines finish, or move on more quickly by clearing the prompt if appropriate.
When it comes to the real boot storm, all the IPs being customised, on slow storage arrays the default SRM setup will see a couple of machines at most succeed and then they will all fail, effectively ending your recovery. You’ll be left with all your machines recovered in DR powered off and with the wrong IP addresses. Thankfully there are some steps we can take to mitigate the impact of the boot storm on our recovery times. In the advanced settings of SRM we can alter the following to tune how SRM reacts to delayed power and IP customisation operations:
- IP Customisation timeout – this can be increased to allow longer before a customisation is timed out and an error occurs, halting recovery of the affected VMs
- Power Off Timeout – This has the same effect for timeouts on power off operations
- Power On Timeout – As you’d guess, the same effect for timeous on power on operations
The real key to avoiding the boot storm lives in the vmware-dr.xml file:
- MaxBootAndShutdownOpsPerCluster – actually found in the config xml file, this set the maximum amount of machines that can be booted or powered off simultaneously
By limiting how many VMs can be powered on at once, and allowing longer timeouts, we can tune SRM to complete the IP customisation operations with less load on the storage and without timing out and failing to recover machines.
Proof is in the pudding
In a recent deployment we found that with just a timeout increase, after an hour almost all operations would timeout and the recoveries would fail, leaving us with a long manual recovery. Once we had the above setting tuned well, the entire IP customisation would complete inside 1 hour with no failures. If it ever comes time for a real failover, I know which option I’d prefer!