This article builds on the previous 101 overview of VMware’s Site Recovery Manager and covers the deployment process at a high level. For detailed instructions on how to roll out SRM please refer to the documentation linked to in the previous article.
- DR projects are notorious for soaking up engineer resource, requiring multiple tests and lots of beard-stroking to achieve a workable solution, so it’s vital to have a business sponsor to drive the project to completion. Any half-hearted attempt at building a DR platform will fail.
- Define the scope of what services – and therefore servers — need protecting, and be wary of scope creep. Be certain the server list is correct. (Don’t be surprised when testing proves that a long-forgotten legacy server is vital to the operation of your intranet!)
- Get quantifiable goals that the DR platform must meet, for example Recovery Time Objectives (RTOs).
- Set expectations of what SRM can do, especially the fact it can only protect virtual servers. This could be a good motivator to P2V more servers, or require physical servers to have DR protection added as a pre-requisite, such as database log shipping.
- Ensure that VMTools is installed and running on all virtual servers to be protected, as SRM uses it for graceful shutdowns during failover.
- Configure networking is set up at the Recovery site :-
- Stretched VLANs – the easiest for server admins as current IP addresses can be used.
- NAT – retain existing IP addresses for servers and change addresss via a firewall out to the rest of the network.
- IP Customisation – SRM has the ability to alter IPs during failover.
- Consider how Internet access will be cut over – do your web/mail servers need their public DNS addresses changing? How will users access services from home or DR site?
- Unless protecting the entire virtual estate the LUNs need to be split into two camps — “Protected” and “Unprotected” – with all in-scope virtual servers moved onto Protected LUNs, the rest hosted on Unprotected. This is because SRM fails over on a per-LUN basis. Storage VMotion is great for performing storage realignment without downtime.
- Deploy ESX servers at the Recovery site, with enough total capacity to run all of the protected servers at peak demand (e.g. over month-end). Ask the storage admins to zone these servers to the Recovery SAN, along with at least one LUN for the cluster.
- Deploy Domain Controller(s) & vCenter server at the Recovery site to manage the local cluster. For medium-large sized deployments use a dedicated SQL server for the vCenter and SRM databases.
- Arrange with the storage admins to have the Protected LUNs replicated over to the Recovery Site and presented to the DR ESX cluster. (Don’t add these LUNs to the cluster yet, this is done by SRM during failover.) Also ensure that Protected LUNs have 20% snapshot space available within their disk groups.
- While you’re annoying the storage admins ask them for an admin logins for both SANs, plus their management IPs.
- Deploy the SRM database: A database needs to be created manually at each site before the SRM installation is performed – step by step instructions for MS SQL Server, Oracle & IBM DB2 are provided in the Administration Guide.
- Deploy the SRM server: At each site build a new Windows virtual server, configure a static IP patch and join AD. Create an ODBC connector to the SRM database. Download the latest version of SRM and follow the installation wizard. Although the SRM servers are paired up during configuration, during the install you point it to the local vCenter server.
- Download and install the vendor-specific Storage Replication Adapter application on the SRM servers.
- Install the vSphere Plugin for SRM on workstations used to manage the virtual infrastructures, including the vCenter servers themselves. This adds an SRM icon to the Solutions & Applications section of the Home screen.
- Connect the sites: In the SRM plugin, supply details of the other site’s vCenter server. Repeat at the other site. Once the sites are connected you will be prompted for login details of the paired vCenter server whenever you enter the SRM plugin.
- Connect to the Storage: Supply details of the SAN’s management IPs and login credentials, to allow SRM to gather storage information and issue failover commands.
- Inventory Mappings: The next step is about matching up resources (clusters, folders, resource pools & network port groups) between Protected and Recovery sites, so that SRM knows how to assign resources at the recovery site.
- Create Protection Groups: Protection groups are just that, groups of virtual servers to be protected within a Recovery Plan. On the Protected site SRM server create a group for each LUN. This will create a placeholder object for each protected VM at the Recovery Site.
- Create Recovery Plans: Once created the protection groups – and the status of the virtual machines within them – is visible on the Recovery Site’s SRM configuration. Create a Recovery Plan and select one or more of these groups to include within the plan. You can customise the Plan to suit your requirements, for example setting the shutdown and startup order of the servers – eg bring the database servers down last and up first – defining application timeouts and mapping networking for Test failovers.
- The first test is simply to verify that failover will actually work. This can be done easily and without disrupting production services using the Test Failover option within SRM. What this does is attach the Recovery Site LUNs to the ESX cluster and power up the VMs on a vSwitch that has no physical NICs attached, thereby completely isolated from the Real World. DR server health can be checked via Console, and when done the Test Failover will return everything to the original state.
- Multi-Tiered Network Test Failover: When configuring the Recovery Plan you can map each server’s vNIC to any port group on your Recovery Site cluster, so it’s possible to create your own isolated vSwitch as a replica of your production network, using port groups and virtual appliance firewalls such as Smoothwall, to simulate your various segments.
- Once you are confident that the failover works you need to schedule a Full failover. This involves a lengthy outage of production services and will therefore be high profile and carry some degree of risk, so test plans should be clearly documented and approved by management via your change control process (trans: cover your butt!).
- An important point relating to Full failover is that there is no failback option. To return to original working configuration you will need to repeat the failover process in the opposite direction. SAN replication must be set up in the opposite direction — which could take a long time to complete – and fresh Protection Groups & Recovery Plans created before failover can be performed back to the Production environment, incurring a second outage. This greatly increases the time and effort required to perform the whole test, so build in plenty of contingency time for testing and troubleshooting – there’s nothing worse than a ticking clock when performing a high-pressure piece of work. Unfortunately, without performing a full test in each direction you cannot be certain that it will work should it be required.
- Report back to the sponsor on how the failover went, with proof that applications all worked correctly, and get written signoff to confirm that the solution meets the scope and goals of the project.
- SRM only provides the technical means to fail over the critical virtual machines – to create a credible Business Continuity Plan it needs to be wrapped up in a process to determine such things as who has the authority to order a failover; circumstances in which failover is required; who can perform the failover, contact details, documentation library, etc.
- An enterprise IT infrastructure will probably involve physical servers that provide business-critical services – Oracle databases, mainframes, firewalls to name but a few – that will impact on BAU planning. SRM setup is likely to be one strand of many.
- Bringing virtual machines online at the DR site is one thing, getting multi-tiered applications to work is quite another! There may be lots of tinkering required to get things working – DNS or hosts file changes, ODBC connections, certificate errors and the like – particularly if required to run with different IP addresses. Ensure these post-failover tasks get added to the failover documentation.
- There should be thought given to how to run the Recovery site as Live over a prolonged period, involving backup & restore, monitoring & management. It will greatly help in a disaster to replicate your IT ops data out to a share at the DR site.
- It is important the DR platform not just be ignored after signoff. It is integral to the IT infrastructure and should be integrated into the Change process, so that changes to protected services, new services deemed in-scope for DR protection, or even SAN or SRM upgrades are managed in a manner that retains integrity of the platform. There also needs to be regular testing scheduled, even if only once a year, to validate the platform, quality of documentation and remind engineers of how the environment works.