VMware Cloud (VMC) on AWS has become more popular over the previous year with many customers looking at VMC on AWS as a good solution for their disaster recovery needs. With the addition of VMware Cloud Disaster Recovery (VCDR) to the DR portfolio, many customers are leveraging both VCDR and VMware Site Recovery (VSR) to protect their applications from on-premises to the cloud. But what about the customer’s that have fully migrated their datacenters into VMC on AWS? Because they are now fully cloud, is the job of disaster recovery and data protection over, or has it just changed? Unfortunately, for many customers, they realize all too late that cloud to cloud DR is just as important as on-prem to cloud DR was prior to their migration. As it was so aptly coined by security professionals years ago, the cloud is just someone else’s datacenter and those datacenters still have connectivity, power and natural issues that the customer still needs to plan for. AWS has put in quite a bit of effort to make their global infrastructure and availability zones (AZ) as resilient as possible, but they still only guarantee an SLA of 99.9% for each AZ. Translated, this means each AZ could go offline for over 8 hours and AWS would not be in violation of their SLA. We, as consumers of AWS resources, have to plan for that level of outage in our architecture designs.
This is one of the reasons VMware introduced stretched clusters to VMC on AWS, allowing a SDDC cluster to span two AZs and have the data synchronously replicated between the two. This technology works great and allows for a zero replication point objective (RPO) in the event of an AZ failure but does come at a higher cost. The number of hosts in a stretched cluster deployment is doubled with six hosts being the minimum required for the feature.
Stretched cluster also does not protect you against other attacks such as ransomware since there are no historic point-in-time recovery options. This means customers are often looking for multi-region cloud-to-cloud disaster recovery options that can provide low RPO and RTO (recovery time objective). This is where VSR can come into play. VSR will provide VMC on AWS customers with the ability to failover between SDDCs in separate regions with RPOs as low as five minutes. With the addition of multi-region SDDC groups, the replication traffic traverses the AWS backbone with bandwidth as high as 50Gbps.
The setup for this takes some configuration but is fairly simple. To start, once the SDDCs are deployed in two different AWS regions, we will need to setup some form of private communication, either a site-to-site VPN or AWS Direct Connect, and change the vCenter FQDN from resolving to the public IP to resolving to the private IP. Next, we can head over to the “Add-Ons” tab and select “Activate” under “VMware Site Recovery”. The console will present us with an option to either utilize the default vCenter extension ID for VSR or to create a custom one, be sure to select the default. Setting a custom extension for VSR is useful for customers using multiple copies of VSR in their on-prem environment where each instance of VSR will need a custom extension, but for this scenario, use the default. The VSR activation takes about 15 minutes so go grab some coffee while we wait. Once that completes, we can click back to view all SDDCs and head over to SDDC Groups.
The SDDC Group is the feature that allows us to connect both SDDCs together via VMware Transit Connect and configures gateway firewall rules to link both vCenters together, similar to how we can link an on-prem vCenter with the SDDC vCenter using Hybrid Linked Mode. By leveraging SDDC groups, both SDDC vCenters can be managed under a single pane of glass in vCenter. Select both SDDCs and click “create”. This process takes 10-15 minutes to complete and, once done, select the “vCenter Linking” tab and select “Link All vCenters”. Now that everything is linked together, we can head into each SDDC and setup the necessary firewall rules.
In the network and security tab for the SDDC, we first will want to select the segments page and add a unique compute segment for the workloads that will be running in the SDDC. These segments are automatically advertised between SDDCs so it’s also important to remove the default segment that gets created with the SDDC. The NSX management and compute firewalls are not stateful, so we will need to create rules allowing traffic in to the VSR and vSphere Replication appliances as well as traffic outbound. By default, the SDDC is deployed with management gateway rules allowing vCenter and ESXi to communicate outbound, but we will also need to write a rule allowing your on-prem network to reach vCenter for management.
To start, we can select Inventory -> Groups -> Management Groups and build some groups for the remote site VSR and vSphere Replication appliances. Groups and rules for the remote site vCenter and ESXi were automatically created when we linked the vCenters in the SDDC Group. With the groups created, we can navigate to “Gateway Firewall -> Management Gateway” and start building the firewall rules. The specific rules required for a SDDC to SDDC configuration can be found here.
With the firewall rules built, we can now log in to vCenter, navigate to the VSR page and click “Open Site Recovery”. This will open a new tab where we can setup the VSR site pair. When we select “Create Site Pair”, we can use the vCenter FQDN of the remote site, login with the Cloud Admin credentials, select VSR and vSphere Replication as available services, and complete the site pair creation. With the site pair created, we can now configure the network, folder and datastore mapping and configure replications, protection groups and recovery plans.
There are some additional items to consider when utilizing cloud to cloud DR with VSR. To prevent the recovered VMs from needing to change their IP addresses, we can extend networks between the two SDDCs with HCX. Not only does this allow VMs to retain their IP addresses, speeding up the recovery, but the setup for the Layer 2 extension within HCX is easy. In the event of a full site failure, we would simply need to go find the extended segments and change them from “Disconnected” to “Routed”.
VSR is extremely powerful and flexible, allowing customers to easily failover and failback workloads between datacenters. With the addition of multi-region VMware Transit Connect, customers can now leverage VSR between AWS regions, further enhancing their data protection stance in the cloud. This provides a great choice for customers to protect their workloads in the event of a regional outage.
A big thanks to Arno Riehs, VMware Cloud Solutions Architect, and Angie Dewalt, AWS VMware Specialist Solutions Architect, for their help in the development of this content.