Today I was trying to add a NFS share from a NAS device to my vSphere lab servers. I’ve used these same NFS shares before, but I had a lab setup so did not care too much that the NFS shares were not on an isolated network. This weekend I decided to fix that sloppy setup and move traffic to isolated NIC’s in their own VLAN. Simple I thought, a quick job.
Somehow just moving the shares to a new network segment on another VLAN broke things. I was puzzled and should have asked Tom I suppose.. but as a good IT guy I had to dig until I find out why by myself. Continue reading “Problems while adding an NFS share to vSphere”
This article builds on the previous 101 overview of VMware’s Site Recovery Manager and covers the deployment process at a high level. For detailed instructions on how to roll out SRM please refer to the documentation linked to in the previous article.
- DR projects are notorious for soaking up engineer resource, requiring multiple tests and lots of beard-stroking to achieve a workable solution, so it’s vital to have a business sponsor to drive the project to completion. Any half-hearted attempt at building a DR platform will fail.
- Define the scope of what services – and therefore servers — need protecting, and be wary of scope creep. Be certain the server list is correct. (Don’t be surprised when testing proves that a long-forgotten legacy server is vital to the operation of your intranet!)
- Get quantifiable goals that the DR platform must meet, for example Recovery Time Objectives (RTOs).
- Set expectations of what SRM can do, especially the fact it can only protect virtual servers. This could be a good motivator to P2V more servers, or require physical servers to have DR protection added as a pre-requisite, such as database log shipping.
- Ensure that VMTools is installed and running on all virtual servers to be protected, as SRM uses it for graceful shutdowns during failover.
- Configure networking is set up at the Recovery site :-
- Stretched VLANs – the easiest for server admins as current IP addresses can be used.
- NAT – retain existing IP addresses for servers and change addresss via a firewall out to the rest of the network.
- IP Customisation – SRM has the ability to alter IPs during failover.
- Consider how Internet access will be cut over – do your web/mail servers need their public DNS addresses changing? How will users access services from home or DR site?
- Unless protecting the entire virtual estate the LUNs need to be split into two camps — “Protected” and “Unprotected” – with all in-scope virtual servers moved onto Protected LUNs, the rest hosted on Unprotected. This is because SRM fails over on a per-LUN basis. Storage VMotion is great for performing storage realignment without downtime.
- Deploy ESX servers at the Recovery site, with enough total capacity to run all of the protected servers at peak demand (e.g. over month-end). Ask the storage admins to zone these servers to the Recovery SAN, along with at least one LUN for the cluster.
- Deploy Domain Controller(s) & vCenter server at the Recovery site to manage the local cluster. For medium-large sized deployments use a dedicated SQL server for the vCenter and SRM databases.
- Arrange with the storage admins to have the Protected LUNs replicated over to the Recovery Site and presented to the DR ESX cluster. (Don’t add these LUNs to the cluster yet, this is done by SRM during failover.) Also ensure that Protected LUNs have 20% snapshot space available within their disk groups.
- While you’re annoying the storage admins ask them for an admin logins for both SANs, plus their management IPs.
- Deploy the SRM database: A database needs to be created manually at each site before the SRM installation is performed – step by step instructions for MS SQL Server, Oracle & IBM DB2 are provided in the Administration Guide.
- Deploy the SRM server: At each site build a new Windows virtual server, configure a static IP patch and join AD. Create an ODBC connector to the SRM database. Download the latest version of SRM and follow the installation wizard. Although the SRM servers are paired up during configuration, during the install you point it to the local vCenter server.
- Download and install the vendor-specific Storage Replication Adapter application on the SRM servers.
- Install the vSphere Plugin for SRM on workstations used to manage the virtual infrastructures, including the vCenter servers themselves. This adds an SRM icon to the Solutions & Applications section of the Home screen.
- Connect the sites: In the SRM plugin, supply details of the other site’s vCenter server. Repeat at the other site. Once the sites are connected you will be prompted for login details of the paired vCenter server whenever you enter the SRM plugin.
- Connect to the Storage: Supply details of the SAN’s management IPs and login credentials, to allow SRM to gather storage information and issue failover commands.
- Inventory Mappings: The next step is about matching up resources (clusters, folders, resource pools & network port groups) between Protected and Recovery sites, so that SRM knows how to assign resources at the recovery site.
- Create Protection Groups: Protection groups are just that, groups of virtual servers to be protected within a Recovery Plan. On the Protected site SRM server create a group for each LUN. This will create a placeholder object for each protected VM at the Recovery Site.
- Create Recovery Plans: Once created the protection groups – and the status of the virtual machines within them – is visible on the Recovery Site’s SRM configuration. Create a Recovery Plan and select one or more of these groups to include within the plan. You can customise the Plan to suit your requirements, for example setting the shutdown and startup order of the servers – eg bring the database servers down last and up first – defining application timeouts and mapping networking for Test failovers.
- The first test is simply to verify that failover will actually work. This can be done easily and without disrupting production services using the Test Failover option within SRM. What this does is attach the Recovery Site LUNs to the ESX cluster and power up the VMs on a vSwitch that has no physical NICs attached, thereby completely isolated from the Real World. DR server health can be checked via Console, and when done the Test Failover will return everything to the original state.
- Multi-Tiered Network Test Failover: When configuring the Recovery Plan you can map each server’s vNIC to any port group on your Recovery Site cluster, so it’s possible to create your own isolated vSwitch as a replica of your production network, using port groups and virtual appliance firewalls such as Smoothwall, to simulate your various segments.
- Once you are confident that the failover works you need to schedule a Full failover. This involves a lengthy outage of production services and will therefore be high profile and carry some degree of risk, so test plans should be clearly documented and approved by management via your change control process (trans: cover your butt!).
- An important point relating to Full failover is that there is no failback option. To return to original working configuration you will need to repeat the failover process in the opposite direction. SAN replication must be set up in the opposite direction — which could take a long time to complete – and fresh Protection Groups & Recovery Plans created before failover can be performed back to the Production environment, incurring a second outage. This greatly increases the time and effort required to perform the whole test, so build in plenty of contingency time for testing and troubleshooting – there’s nothing worse than a ticking clock when performing a high-pressure piece of work. Unfortunately, without performing a full test in each direction you cannot be certain that it will work should it be required.
- Report back to the sponsor on how the failover went, with proof that applications all worked correctly, and get written signoff to confirm that the solution meets the scope and goals of the project.
- SRM only provides the technical means to fail over the critical virtual machines – to create a credible Business Continuity Plan it needs to be wrapped up in a process to determine such things as who has the authority to order a failover; circumstances in which failover is required; who can perform the failover, contact details, documentation library, etc.
- An enterprise IT infrastructure will probably involve physical servers that provide business-critical services – Oracle databases, mainframes, firewalls to name but a few – that will impact on BAU planning. SRM setup is likely to be one strand of many.
- Bringing virtual machines online at the DR site is one thing, getting multi-tiered applications to work is quite another! There may be lots of tinkering required to get things working – DNS or hosts file changes, ODBC connections, certificate errors and the like – particularly if required to run with different IP addresses. Ensure these post-failover tasks get added to the failover documentation.
- There should be thought given to how to run the Recovery site as Live over a prolonged period, involving backup & restore, monitoring & management. It will greatly help in a disaster to replicate your IT ops data out to a share at the DR site.
- It is important the DR platform not just be ignored after signoff. It is integral to the IT infrastructure and should be integrated into the Change process, so that changes to protected services, new services deemed in-scope for DR protection, or even SAN or SRM upgrades are managed in a manner that retains integrity of the platform. There also needs to be regular testing scheduled, even if only once a year, to validate the platform, quality of documentation and remind engineers of how the environment works.
This is the first in a short series of high level articles intended to explain how some of VMware’s products and features work, to help anyone that needs to get up to speed quickly on a particular topic.
What Is It?
Site Recovery Manager is a VMware product that enables protection of a virtual infrastructure at a site level. Once configured, SRM can orchestrate the recovery of a failure of the primary site, bringing virtual servers online quickly, and in a predetermined order, at the recovery site.
What Do I Need to Deploy It?
- A compatible SAN – The SRM application uses a vendor-specific Site Recovery Agent (SRA) to talk to the SAN. Check the SRM Storage Compatibility Matrix to ensure your SAN is on the list (link below).
- Two SANs – Replication of data is done by the SANs, not by SRM, so the Recovery Site requires a SAN of the same type as the Protected Site SAN that has sufficient spare capacity to host a mirror copy of the Protected Site’s data. You can decide on a per-LUN level whether to replicate or not. But remember you will need more storage.
- SAN Replication – Terminology will vary between vendors, but basically a Master/Slave setup is required between the SANs, with all changes on the Protected SAN copied over — as close to realtime as possible – to the Recovery SAN. This means you’ll need a fast link between sites and to pay for SAN licensing to enable mirroring between SANs.
- Good relations with your SAN engineers – the SRA requires an admin login to the SAN in order to interrogate it and initiate failovers.
- Recovery Site resources – there needs to be ESX or ESXi servers at the Recovery Site with enough resources to run your protected servers. The Recovery site also requires its own vCenter and SRM server, linked to those at the Protected Site, along with any supporting servers such as Active Directory and DNS.
How Much Does It Cost?
The core SRM product licensed in units covering 25 managed servers, with a list price of US $11,250 per 25 VMs.
Other things to consider when budgeting for SRM:
- Remember the prerequisites such as SAN replication licensing;
- Extra Storage;
- Adequate WAN links for SAN replication;
- Provisioning ESX resources at the Recovery Site;
- Windows licenses for two vCenter servers and two SRM servers.
- The Requirement for Two vCenter servers and two SRM servers
- Larger scale deployments should also consider storing the SRM database externally.
Where Do I Go From Here?
SRM Storage Compatibility Matrix: http://www.vmware.com/pdf/srm_storage_partners.pdf
VMware SRM Docs page: http://www.vmware.com/support/pubs/srm_pubs.html
Administating SRM: http://www.lulu.com/product/file-download/administering-vmware-site-recovery-manager-40/6522403, Mike Laverick’s eBook is essential reading before deploying SRM; it costs less than £7 and all proceeds go to Unicef, you can get a hard copy too.
VMware Communities page for SRM: http://communities.vmware.com/community/vmtn/mgmt/srm, seriously folks I cannot state how useful this resourse is in your first steps to getting to grip with the product. it should also be your fisrt stop when troubleshooting.
The next step in this series is a 102 post on how to setup working SRM evironment.
Planning storage is a simple thing, you go to your Storage Admin’s and say, I need x amount of LUNs of this size please for my ESX servers and they NO, we only do xGB size LUN’s, or they breath thought their teeth like a motor mechanic or plumber and say, Storage doesn’t grow on trees you know, we don’t have much left, are you sure you really need all that space, etc.
But I digress. 🙂
Continue reading “It is all a question about which path to take.”