Advertisements

Major Issue with Change Block Tracking

I was having a conversation with a Developer of a Virtualization Backup software, regarding CBT and he got me very worried,  It appears that there is a major issue with the way VMware handles the indexing of the ChangeID. the conversation went something like below.

Tom, You There?

Yes.

I think I’ve found a big problem with CBT BRB, Yep, just confirmed the problem and it’s nasty. They’re reusing the UUIDs for Change Tracking 😮

OK Explain how this is an issue

OK, basically this here is the problem, say you want to patch your server, you add a snapshot before you do it. You then patch your server and go home for the night. During the night VDR or whatever backs up your server, and in the morning you come in the morning and everything is [sic: Broken], so you revert the snapshot. As expected all is now well again. You think you will leave the server alone and go off to investigate the problem with the patch,it then gets backed up again that night. Now your backups are all corrupt and you’ve missed a whole days changes and worse, it’ll never get those changes, and as all backups are linked incrementals, all future backups are now also going to be corrupt and you wont know until you restore. It’s because on revert they roll the ChangeID back.

OK I said so give me an example

Now it must be said that at the time of this conversation had never used vDR, not having access to enough 64bit hardware in a lab, so could not personally verify this information but I have the upmost respect for the source, and he is not one to glamorise things for his or his companies self gain. So carrying on the conversation, here continues the paraphrasing of the conversation cleaned up for readability.

So let me give you some actual ChangeIDs to show what I mean

as you can see from the log file /var/log/esxpress/log/esxpress.log:

Asking host for a list of dirty blocks for ‘PHDBes.vmdk’ from ChangeID: ’52 da d3 58 9d 98 ce b6-44 9b 7d 51 6d 8c bd cd/731′

OK explain what we are seeing here

Well the first part is a UUID that only changes when you enable/disable change tracking on a VMDK and the number after the / is an incrementing counter.  What this shows is that the base VMDK ChangeID is 731, so in other words the last time a snapshot was committed, the ChangeID was set to 731,

Now if another snapshot is added the ChangeID would most likely be ’52 da d3 58 9d 98 ce b6-44 9b 7d 51 6d 8c bd cd/732′ or ’52 da d3 58 9d 98 ce b6-44 9b 7d 51 6d 8c bd cd/733′

when VMware is asked for the changed blocks from say 733, it’ll give me the changes back to 732 not 731

but if I revert from 733 to the base, my ChangeID goes back to 731

then I add a snapshot, it gives me ChangeID 732

I say "hey what’s changed" and it says "oh 732, nothing"

even though I’ve had a whole days changes, because they have reused 732

so you then don’t back up any of the blocks that changed that day and never will

that is stupid,  It gets even more FUBAR if you use snapshot trees they reuse ChangeIDs all over the place and it never gives changes back to the base VMDK, it gives it to the last ChagneID in the DB,  Which means you can miss bits, or get wrong parts, when you use the ‘goto’ in the snapshot manager.

Now if a revert caused the UUID to change that would fix everything, or they should always give the changes from the base ChangeID to the current change id, not from the last ChangeID to the current ChangeID.

1->2->3

i ask changes for 3

i get 2->3

they should give me 1->3

i could then ask changes for 2

1->3 minus 1->2 is the difference

so I get the same functionality

but they way they do it, i lose 1->2 if I don’t have changeID 2

and remember the ChangeID increments every time a user adds a ChangeID

try this

try it with VDR see if you get the same thing

Now this this is the worrying part it affects any product that uses CBT, so not only esxPress, but the Veeam products and vDR too.

My next question was “So what can be done about this then?”

We put a work around in our code, if a backup is taken of a machine has a snapshot we force the changeID to * what this means it it will copy ‘all dirty blocks’. Now this is obviously slower, but it is 100% reliable.

Now if verified this is a pretty serious flaw, I think you will agree.

Advertisements

18 comments

7 pings

Skip to comment form

  1. Yep. Surprised VMware did not pick up on this earlier or at least TEST it!

  2. Patrick, I have one question, Has Your company raised this issue with VMware as a support ticket?

    • Tom on May 11, 2010 at 1:45 pm

    Please update this blog entry if you receive any new information.

    Thank you, Tom

    1. I will do – I know that PHD are in the process of Writing a White Paper on the subject

  3. Alex Mittell is working on a whitepaper with screenshots to prove this although we’re sure it’s 100% replicatable.

    I’m confirming the support ticket status for you.

  4. Hi Tom, I have confirmed that we have informed VMware personnel about this issue so they are aware.

  5. I also talked with the team and they are aware of the issue. As a workaround, you can make sure that the next backup taken after a revert is a full backup. If the backup application does not allow control of this, this can be achieved by deleting the “*-ctk.vmdk” files for that particular VM.

    Thanks for filing the bug. A heads up ahead of time would have been appreciated. You guys both know how to get ahold of us! 🙂

    John

    • Brendan on May 11, 2010 at 8:18 pm

    Patrick, can you please clarify when did your company raised this issue with VMware (just today?), and what is the SR id?

  6. We tried three times through our TAP manager to raise this issue but received no response. This information was discovered months ago but it appears not taken seriously by VMware support to warrant a response.

    I’m sorry if my resppnse sounds harsh, but as a TAP of VMware I expected to recieve at least a “thank you for reporting the issue we see the same thing” email in return. In light of this and the seriousness of the problem I felt it necessary to share the information with Tom, him being an independent third party, to confirm that the issue was indeed real.

    We have a suggesteed fix for the issue that I feel would benefit your product for all users of ESX Change Block Tracking – if you would kindly contact me I would be more than happy to work with your engineers to get the issue resolved in the correct way.

    Regards,

    Alex (alex@phdvirtual.com)

  7. Oh and I mentioned this openly at the London VMUG on the 6th of May to other vendors and VMware employees, so I have made no secret of this. I was intending to contact the VMware guy I talked to who seemed that he would be able to help me get it resolved, but I only landed back in the USA late last night and Tom’s post had already gone up. Apologies for any offense caused – I just want this problem fixed asap, nothing more.

    Alex

    • Brendan on May 12, 2010 at 2:41 am

    Just noticed Jonn’s post above. This is really starting to look nasty… I found a thread posted by the same vendor on VMware Communities a while ago, apparently they knew about the issue months ago, and even had workaround in place since their previous release. But, it looks like they decided to keep it low profile, and did not tell anyone – not even VMware (!) – until now.

    I see 2 possible reasons for this:

    1. They wanted to make the information public along with announcement of their new Citrix backup product few days ago, in order to paint VMware in a bad light and give everyone more reasons to consider Citrix.

    2. They kept it low in a hope that customers from other backup vendors will start experiencing corrupt backups and complain about. And then they would say – look, we do not have this issue. But this did not happen because the scenario is pretty unlikely (reverting snapshot on production system = dataloss). So finally, they decided it is time to play this card anyway.

    Whatever it is, this definitely does not present this vendor in a good light. It is quite dirty and cheap tactics, if you ask me.

    Anyhow… time to chase my backup vendor for a hotfix… 🙂

    1. Brendan, please read the comments by Alex Mittell. I can see your position but it seem a little harsh in its tone.

  8. We had no intent to hide this, I did not want to post about it on an open forum without evidence it was really a problem and get accused of “vendor bashing”. It took some time to confirm other vendor products (including vDR) all experienced the same problems I had been seeing with our implementation and we when we attempted to report the issues through the TAP program, we just got no response. (and still haven’t, I requested another contact today after this blog was posted, if anyone at VMware is reading this please feel free to contact me directly!)

    As I said, we have a work around, we’ve made it public for all of the competition to use, this of course is of no benefit to us market-wise, but we don’t want to see people losing their data. However, what I would really like is for VMware to contact us so I can offer my advice (for what it is worth, in making a product to use this specific feature) on the best way to implement in a change to the way ChangeIDs work that covers all bases elegantly.

    In my VMTN post in the backup section (esxpress 4.0) I even offer a solution viable to all vendor users, clear out old backups and start again, this time avoiding the issue (avoiding taking backups with snapshots still on a VM, and avoiding reverts/gotos). Of course if you have never reverted a snapshot or used trees, then you need not worry.

    I’m sorry you feel that I’ve been underhanded about this but as I hope many in the community will attest that is not my way, I am purely a technologist, not a marketing person.

    Alex

    • MB-NS on May 14, 2010 at 2:15 pm

    Hello,

    thank you for reporting this issue.
    I raised it on Veeam Forums and thougth it would be worth pointing to their feedback here.
    Subject on Veeam forums :
    http://www.veeam.com/forums/viewtopic.php?f=2&t=3699&p=15139#p15014

    Regards

  9. Hey Alex, I was out all week at EMC World and am just now catching up on this conversation. Thanks for going through channels, and sorry the information didn’t seem to get to the right people.

    The team has been engaged and there is now a public kb that should track any developments on this issue: http://kb.vmware.com/kb/1021607 I think someone should reach out to you and make sure we’re on the same page about a fix.

  10. Excellent. Thanks Jon! I’ll keep an eye on the KB, and my email address is in my first comment above if anyone wants to contact me I’d be glad to offer my findings on the issue / suggestions on a fix. 🙂

    Cheers,

    Alex

  11. In case anyone viewing this thread is interested in how Vizioncore handles this, there is no problem in the vRanger Pro 4.5 implementation for CBT.

    Vizioncore has posted the details on its Backup 2.0 blog site:
    vcommunity.vizioncore.com/…/no-problem-in-vranger-pro-4-5-with-the-vmware-cbt-defect.aspx

    Thanks all! Kellyp

    • Warren Brown on July 12, 2010 at 3:10 pm

    Thanks for the detailed info. I filed a ticket long ago and VM support Webex’d into my servers to fix. They never resolved it. We are a lab environment and so we use a ton of snapshots. VDR was refusing to backup one VM. They looked through the vmx and directory and fell back to the “you shouldn’t use snapshots to run from”. The machines are running fine, but this one would not backup. VDR logged some crazy message that it could not find a specific -flat.vmdk file, which was correct since that file did not exist. Well, when I read this post, I went and looked and there was a -cbt.vmdk file with the same base name and number as the phantom flat file. This -cbt.vmdk file was in the middle of the snapshot sequence. I deleted it and the one other cbt file, then put the system back into VDR and it backed up perfectly. CBT was causing VDR to look for non-existent files. I am strongly leaning to adding this:
    ctkDisallowed = “TRUE”
    into the vmx files of all my VM’s where users routinely use snapshots. This will turn off CBT in that VM. Yes VDR will slowly plod along without change block tracking, but at least the backups will be valid.

    I will reopen my old ticket with VM support and update it.

  1. […] This post was mentioned on Twitter by tom_howarth and cody_bunch, PlanetVM Net. PlanetVM Net said: New post on PlanetVM.NET http://tinyurl.com/2bvud3m […]

  2. […] This post was mentioned on Twitter by . said: […]

  3. […] You can find out more about this CBT issue on Tom Howarth’s blog post! […]

  4. […] is more information on Tom Howarth's blog site which basically covers the jist of the problem. Major Issue with Change Block Tracking | PlanetVM As far as I know esXpress 4.0 is the only backup product to date that has confirmed a current […]

  5. […] corrupted incremental backups when using vSphere’s Change Block Tracking (CBT). Howarth’s post Major issue with Change Block Tracking recounts his conversation and exploration of the problem with the developer. In summary, Howarth […]

  6. […] 0 Concerned about the CBT problem in VMware, and the implications for vRanger Pro 4.5 which includes CBT support? No need to be. vRanger Pro is […]

  7. […] Change Block Tracking Backup Bug (Plantvm.net) […]

Comments have been disabled.

%d bloggers like this: