We’ve Changed the guard at Buckingham Palace, it’s time for Hybrid – Vembu to the Rescue

In this, the eighth article in our series investigating the benefits of Vembu BDR for Virtualized Environments, we carry on examining Vembu’s migration capabilities. We all know that backing up your data is only one part of the equation. The ability to recover is the other, and arguably more important, side. This is where Vembu BDR really shines.

Again life is being kind to you, and once again you are sitting in your cube, monitoring your environment. OK, you know the score by now — playing Gorf! Well your wrong you’ve stated re-playing Donkey Kong after watching the movie Pixels at Steve’s Daughters birthday party. This is more fun as both you and Steve can now have a real points tally competition.  All is calm in the world and you are planning the next phases of the company cloud migration.

Cast your mind back to our previous conversation where we managed to moved completely into Azure and started the decommission our last on-premises datacenter. Well there has been a change at the top and they and the powers that be have decided that although Cloud is a great idea and enabler, the release of Windows 2016 and AzureStack has brought some interesting options to the table, and they now want to take a more hybrid position on their Cloud utilization and have therefore decided to return the corporate crown jewels; the Databases back in house.  You die a little inside as, this is what you actually advised in the first place, but hey never mind.

Time to Hybrid

Read More

News: Veeam Enters Physical Backup Land

October 8, 2014: Today at the Las Vegas VeeamON conference, Veeam announced its first foray into the world of physical device backup. With the rather catchy name “Veeam Endpoint Backup Free,” the product, when it is released, will be able to back up a physical endpoint (read Windows-based operating system) to a NAS share or a Veeam backup repository.

First, a few things about the product:

Read More

Quest for VM Backup Solution

This goes along with my last post around the time VMworld 2011 where I was in dire need of a new backup solution for my virtual machines.  I figured I would share some of the good, the bad and the ugly as my journey for finding a new backup vendor continues.

First off I will say I have been using PHD (formally esXpress) for about four years now. Back in the day I had great relations with the staff there, great support and overall had no issues with the backup software. Piece of mind is key for me knowing that backups will run fine and allow me to keep my weekends free from unnecessary work.  Over the past few years, esXpress was bought out, members of the staff have changed and from my viewpoint on things, their support and customer service has dwindled.  We have been stuck using an outdated version to keep the Full/Delta model running with encryption for a good couple years now. That had worked for a while but the last version was pretty buggy, they really didn’t support that version much longer and support calls in to them were met with various “well you can disable this feature and it should work fine” when the feature they wanted you to disable was allowing simultaneous backups, crucial part for keeping within the backup window.  All those factors caused us to rethink our backup solution.

The first criteria was that the solution could encrypt the backups for us, allowing us to offsite tapes without encrypting the tape as a whole.  The only solution I saw that fit that was Acronis.  We currently use them for our desktop imaging and albeit a few bugs and hiccups here and there, it works pretty well. On top of that being able to integrate the bulk backup solution to the same management window I thought it was a perfect fit.  At VMworld 2011 they released vmProtect to go along with their Backup and Recovery 11 product.

Acronis vmProtect 6

Good – Easy to setup, allows sandbox restores to test backups, encryption built in (AES 256), easy creation of jobs

Bad – No centralized jobs / management.  Can’t simultaneously backup VM’s per appliance.  Would have to create multiple jobs with certain VM’s and run it at the same time to get “simultaneous” backups.  I will be honest after hearing about single VM at a time per job, I stopped testing right then and there as it wouldn’t cut it for me.

Overall – Acronis seemed to actually put together a pretty good backup solution in my mind with vmProtect.  It was pretty easy to setup, configure and get going with your backups.  A web interface that was very GUI oriented made it visually easy to see what was going on and configurations that were made.   While it lacks larger scale features that could drive an administrator insane (no simultaneous backups), for SMB’s with a few hosts or a small SoHo it could be a great fit.

Acronis Backup and Recovery 11

This is the product I was testing prior to vmProtect came out and after VMworld 2011 after a couple long discussions with other attendees and staff at the Acronis booth.  At VMworld, I was pretty set that Acronis was going to be my solution from what I had seen on their page, initial testing and experience with their workstation side of things.  After that point, things just went downhill for me and ABR 11 Virtual Edition.

Good – Encryption built into the backups, nice centralized management, multiple VMs backing up per appliance (up to ten I’ve been told).

Bad – This list may get a little long as my experiences over the past three weeks haven’t been the best. The setup seems pretty easy, but in the end it isn’t easily upgradable as they have released a patch or two recently. A red flag went off in my head at VMworld as even one of the technical guys said it was a pain to setup, but once it was up and running it worked fine.  Software glitches, if you searched for Acronis errors there are other reviews from past versions of the product, and also on their own forum boards which are littered with negative comments in regards to them adding features but not fixing major flaws in the system.  Licensing server is pretty cumbersome especially when they only release a 15 day trial which makes you continually add licenses in, gets pretty cluttered pretty fast.  No clear documentation on access rights, what is needed where.  I can’t tell you how many times I would test a backup, move it to a larger backup job, even clone the backup job to keep the settings only to have it error out due to access is denied messages.  Connecting directly to the appliances and running the backup there would work but not from the management console running the same job.  Personally I was working with an account rep, who I have no gripes with, but the process made getting support when I needed it hard as they would want me to write the problem to the account rep, who would forward it to the support person, then the support person would write back to the account rep and who would forward it to me.  The few webex’s we did he would troubleshoot the error messages by going directly to the appliances and the backups would work there, but later that night would try to run the job from the management console and it would fail.

Overall- In the end this solution just didn’t fit our needs.  We were looking for a newer solution that was reliable and well known. I felt I would be switching one unstable product for another and that is something no one should do when it comes to disaster recovery items.  There are just too many downsides, too many negative comments on their own forum boards and other blogs to continue making a square peg fit in a circular hole.  While I know there are a good handful of people that have it up and running successfully in their environment, after three weekends of trying to switch over and never having a successful full backup I’m choosing to look other places for a new solution.

Being prepared for the unexpected, and the expected

I will start off by saying I tend to consider myself a prepared person for most situations and tend to think in the logic of the flow charts if the result of A is this, then move here, if not move here.  Recently I found myself scrambling after our old SAN that stores our bulk backups took a dump and we lost the controller.

I can’t say this wasn’t expected, but I found myself not prepared with the appropriate plan of action which made for a week and a half of long days and nights with my director on my case to get things back up and running.  While all I needed was a simple storage solution for the bulk backup to push to, there was also a counter part of getting these backups to our tape for offsite storage.  When this all happened there wasn’t a huge amount of panic as the best thing you can do is to stay cool and collected, logically think “what is the best option to move forward with?”

For me this was using our little IOmega ix4 NAS box which allowed me to just create a SMB share and pointed the backups there.  It didn’t quite have the performance of our OLD (IBM DS400) Fiber Channel box, but it did well enough to think, hey this isn’t that bad.  I was able to rerun all the Full backups from the weekend and was on my way to having my temporarily solution in place. This is when part two of our backup process came into place, moving the backup files to tape.  I rudely found that I was not able to get Backup Exec to attach to that SMB share and backup the full files.  While I know I probably could’ve eventually figured out how to get Veritas to see those files it seemed like it wasn’t the best solution for a timeframe of getting a new storage solution in place.

I then went in the direction of creating a iSCSI disk from the ix4 to the backup server running Backup Exec.  This in itself was a small challenge since we’re a FC shop, I have had next to no iSCSI experience, a bit of time on Google was able to help me muddle through the process of setting it up.  I copied the files off of the SMB share, onto a small USB Storage drive, attached the iSCSI drive to the server and copied those files back up.  I thought well, this is good, now I can see the drive through Veritas and I’ve always heard about how good iSCSI is, sweet here is my solution!  This time around, performance on the iSCSI drive just tanked. While I know iSCSI works and the performance is there, or companies like NetApp wouldn’t be in business anymore, my configuration wasn’t up to snuff obviously.  So here is attempt number two in the books, no closer to finding a solid solution to hold off my management from making a rash purchase that I would have to live with for the next five to seven years I started to panic a little.

After a week long process of trying to get this up and running, staying up to check on performance during the backup time window, I was starting to get really frustrated and knew my time was drawing short. The golden rule of IT kept playing over in my head, lose data, lose your job and I was going on a week of no bulk backups.  I was about to go to bed, when my girlfriend reassured me, you’ll find something soon, and that is when the light went on in my head.  Earlier that day looking at our production SAN for how much storage we would need to replace our production san (move our production to backups) I remembered there was a 700GB chunk of storage not being used.  Funny thing was the DS400 had 700GB of storage we were using for backups… a perfect match. There should be no performance issue going onto a more powerful SAN and it’s on the fiber network.  The biggest obstacle would be convince my director to allow me to carve this logical drive out for backups, where he always put his foot down that he wanted backups and productions strictly separated.  That night I sent an email stating all the reasons while we should use this as our temporally solution, he agreed to my idea and off I went.

Right now that is the solution we’re using and it seems to be holding up, at least for the time being.  The moral of the story is to think what equipment you have, the performance of that equipment and how are you prepared for the unexpected, or in my case the expected.  While I thought while we have this Iomega ix4 which I had been using for my R/D lab and it had worked well for that, it couldn’t handle 15-20 concurrent connections all dumping backup files.  If we didn’t have that 700GB on our production storage, I would probably still be in the weeds.  While you can’t always have the equipment onsite to counter any problem, it doesn’t hurt to think about where you are vulnerable or single points of failure and what you might have to do to counter those issues if they ever arise.

Virtualisation 102 – SRM

This article builds on the previous 101 overview of VMware’s Site Recovery Manager and covers the deployment process at a high level.  For detailed instructions on how to roll out SRM please refer to the documentation linked to in the previous article.

Preparation Phase

  • DR projects are notorious for soaking up engineer resource, requiring multiple tests and lots of beard-stroking to achieve a workable solution, so it’s vital to have a business sponsor to drive the project to completion.  Any half-hearted attempt at building a DR platform will fail.
  • Define the scope of what services – and therefore servers — need protecting, and be wary of scope creep.  Be certain the server list is correct.  (Don’t be surprised when testing proves that a long-forgotten legacy server is vital to the operation of your intranet!) 
  • Get quantifiable goals that the DR platform must meet, for example Recovery Time Objectives (RTOs). 
  • Set expectations of what SRM can do, especially the fact it can only protect virtual servers.  This could be a good motivator to P2V more servers, or require physical servers to have DR protection added as a pre-requisite, such as database log shipping. 
  • Ensure that VMTools is installed and running on all virtual servers to be protected, as SRM uses it for graceful shutdowns during failover.
  • Configure networking is set up at the Recovery site :-
    • Stretched VLANs – the easiest for server admins as current IP addresses can be used.
    • NAT – retain existing IP addresses for servers and change addresss via a firewall out to the rest of the network.
    • IP Customisation – SRM has the ability to alter IPs during failover.
  • Consider how Internet access will be cut over – do your web/mail servers need their public DNS addresses changing?  How will users access services from home or DR site?
  • Unless protecting the entire virtual estate the LUNs need to be split into two camps — “Protected” and “Unprotected”  – with all in-scope virtual servers moved onto Protected LUNs, the rest hosted on Unprotected.  This is because SRM fails over on a per-LUN basis.  Storage VMotion is great for performing storage realignment without downtime.
  • Deploy ESX servers at the Recovery site, with enough total capacity to run all of the protected servers at peak demand (e.g. over month-end).   Ask the storage admins to zone these servers to the Recovery SAN, along with at least one LUN for the cluster.
  • Deploy Domain Controller(s) & vCenter server at the Recovery site to manage the local cluster.  For medium-large sized deployments use a dedicated SQL server for the vCenter and SRM databases.
  • Arrange with the storage admins to have the Protected LUNs replicated over to the Recovery Site and presented to the DR ESX cluster.  (Don’t add these LUNs to the cluster yet, this is done by SRM during failover.)  Also ensure that Protected LUNs have 20% snapshot space available within their disk groups. 
  • While you’re annoying the storage admins ask them for an admin logins for both SANs, plus their management IPs.

Build Phase

  • Deploy the SRM database:  A database needs to be created manually at each site before the SRM installation is performed – step by step instructions for MS SQL Server, Oracle & IBM DB2 are provided in the Administration Guide.
  • Deploy the SRM server:  At each site build a new Windows virtual server, configure a static IP patch and join AD.  Create an ODBC connector to the SRM database.  Download the latest version of SRM and follow the installation wizard.  Although the SRM servers are paired up during configuration, during the install you point it to the local vCenter server.
  • Download and install the vendor-specific Storage Replication Adapter application on the SRM servers.
  • Install the vSphere Plugin for SRM on workstations used to manage the virtual infrastructures, including the vCenter servers themselves.  This adds an SRM icon to the Solutions & Applications section of the Home screen.
  • Connect the sites:   In the SRM plugin, supply details of the other site’s vCenter server.  Repeat at the other site.  Once the sites are connected you will be prompted for login details of the paired vCenter server whenever you enter the SRM plugin.
  • Connect to the Storage:   Supply details of the SAN’s management IPs and login credentials, to allow SRM to gather storage information and issue failover commands.
  • Inventory Mappings:   The next step is about matching up resources (clusters, folders, resource pools & network port groups) between Protected and Recovery sites, so that SRM knows how to assign resources at the recovery site. 
  • Create Protection Groups:   Protection groups are just that, groups of virtual servers to be protected within a Recovery Plan.  On the Protected site SRM server create a group for each LUN.  This will create a placeholder  object for each protected VM at the Recovery Site.
  • Create Recovery Plans:   Once created the protection groups – and the status of the virtual machines within them – is visible on the Recovery Site’s SRM configuration.  Create a Recovery Plan and select one or more of these groups to include within the plan.  You can customise the Plan to suit your requirements, for example setting the shutdown and startup order of the servers – eg bring the database servers down last and up first – defining application timeouts and mapping networking for Test failovers.

Testing Phase

  • The first test is simply to verify that failover will actually work.  This can be done easily and without disrupting production services using the Test Failover option within SRM.  What this does is attach the Recovery Site LUNs to the ESX cluster and power up the VMs on a vSwitch that has no physical NICs attached, thereby completely isolated from the Real World.   DR server health can be checked via Console, and when done the Test Failover will return everything to the original state.
  • Multi-Tiered Network Test Failover:  When configuring the Recovery Plan you can map each server’s vNIC to any port group on your Recovery Site cluster, so it’s possible to create your own isolated vSwitch as a replica of your production network, using port groups and virtual appliance firewalls such as Smoothwall, to simulate your various segments.
  • Once you are confident that the failover works you need to schedule a Full failover.  This involves a lengthy outage of production services and will therefore be high profile and carry some degree of risk, so test plans should be clearly documented and approved by management via your change control process (trans: cover your butt!).
  • An important point relating to Full failover is that there is no failback option.  To return to original working configuration you will need to repeat the failover process in the opposite direction.  SAN replication must be set up in the opposite direction — which could take a long time to complete – and fresh Protection Groups & Recovery Plans created before failover can be performed back to the Production environment, incurring a second outage.  This greatly increases the time and effort required to perform the whole test, so build in plenty of contingency time for testing and troubleshooting – there’s nothing worse than a ticking clock when performing a high-pressure piece of work.   Unfortunately, without performing a full test in each direction you cannot be certain that it will work should it be required.
  • Report back to the sponsor on how the failover went, with proof that applications all worked correctly, and get written signoff to confirm that the solution meets the scope and goals of the project.

Other Considerations

  • SRM only provides the technical means to fail over the critical virtual machines – to create a credible Business Continuity Plan it needs to be wrapped up in a process to determine such things as who has the authority to order a failover;  circumstances in which failover is required;  who can perform the failover, contact details, documentation library, etc.
  • An enterprise IT infrastructure will probably involve physical servers that provide business-critical services – Oracle databases, mainframes, firewalls to name but a few – that will impact on BAU planning.  SRM setup is likely to be one strand of many.
  • Bringing virtual machines online at the DR site is one thing, getting multi-tiered applications to work is quite another!  There may be lots of tinkering required to get things working – DNS or hosts file changes, ODBC connections, certificate errors and the like – particularly if required to run with different IP addresses.  Ensure these post-failover tasks get added to the failover documentation.
  • There should be thought given to how to run the Recovery site as Live over a prolonged period, involving backup & restore, monitoring & management.   It will greatly help in a disaster to replicate your IT ops data out to a share at the DR site.
  • It is important the DR platform not just be ignored after signoff.  It is integral to the IT infrastructure and should be integrated into the Change process, so that changes to protected services, new services deemed in-scope for DR protection, or even SAN or SRM upgrades are managed in a manner that retains integrity of the platform.  There also needs to be regular testing scheduled, even if only once a year, to validate the platform, quality of documentation and remind engineers of how the environment works.

esXpress runs into issues in 2010

Looks like the change to 2010 caused issues, very simular to what most feared when we changed into the year 2000. It looks like when we moved into 2010 it caused issues with all version of esXpress, 3.1.* and 3.6 not allowing it to run delta backups.  Right now the scheduled backups are running as straight full backups, which obviously adds more time and more storage space needed.  I don’t believe the dedup process is affected with this glitch.

It looks like they’re already in testing for a fix, and I have been told that they will be patching both 3.1.* and the 3.6 versions (for all you that are still running 3.1.*, you wont need to upgrade quite yet). For more updates please keep your eye on this thread – http://www.phdvirtual.com/forums?func=view&id=1898&catid=13

I’m hoping they get out a patch sometime today (1/5/10) but in the mean time make sure you have plenty of storage space to house all the full backups until they fix the issue.

esXpress Dramatically Reduces Seattle Financial Group’s VMware Recovery Time by 92%

esXpress Dramatically Reduces Seattle Financial Group’s VMware Recovery Time by 92%
Well esxExpress has done it again,  to day PHD Virtual Technologis announced to winning of a new customer Seattle Financial Groupwho have standardised on esXpress for thier Business Optimization and Disaster Recovery in its Virtual Environment Continue reading “esXpress Dramatically Reduces Seattle Financial Group’s VMware Recovery Time by 92%”