I was having a conversation with a Developer of a Virtualization Backup software, regarding CBT and he got me very worried, It appears that there is a major issue with the way VMware handles the indexing of the ChangeID. the conversation went something like below.
Tom, You There?
I think I’ve found a big problem with CBT BRB, Yep, just confirmed the problem and it’s nasty. They’re reusing the UUIDs for Change Tracking 😮
OK Explain how this is an issue
OK, basically this here is the problem, say you want to patch your server, you add a snapshot before you do it. You then patch your server and go home for the night. During the night VDR or whatever backs up your server, and in the morning you come in the morning and everything is [sic: Broken], so you revert the snapshot. As expected all is now well again. You think you will leave the server alone and go off to investigate the problem with the patch,it then gets backed up again that night. Now your backups are all corrupt and you’ve missed a whole days changes and worse, it’ll never get those changes, and as all backups are linked incrementals, all future backups are now also going to be corrupt and you wont know until you restore. It’s because on revert they roll the ChangeID back.
OK I said so give me an example
Now it must be said that at the time of this conversation had never used vDR, not having access to enough 64bit hardware in a lab, so could not personally verify this information but I have the upmost respect for the source, and he is not one to glamorise things for his or his companies self gain. So carrying on the conversation, here continues the paraphrasing of the conversation cleaned up for readability.
So let me give you some actual ChangeIDs to show what I mean
as you can see from the log file /var/log/esxpress/log/esxpress.log:
Asking host for a list of dirty blocks for ‘PHDBes.vmdk’ from ChangeID: ’52 da d3 58 9d 98 ce b6-44 9b 7d 51 6d 8c bd cd/731′
OK explain what we are seeing here
Well the first part is a UUID that only changes when you enable/disable change tracking on a VMDK and the number after the / is an incrementing counter. What this shows is that the base VMDK ChangeID is 731, so in other words the last time a snapshot was committed, the ChangeID was set to 731,
Now if another snapshot is added the ChangeID would most likely be ’52 da d3 58 9d 98 ce b6-44 9b 7d 51 6d 8c bd cd/732′ or ’52 da d3 58 9d 98 ce b6-44 9b 7d 51 6d 8c bd cd/733′
when VMware is asked for the changed blocks from say 733, it’ll give me the changes back to 732 not 731
but if I revert from 733 to the base, my ChangeID goes back to 731
then I add a snapshot, it gives me ChangeID 732
I say "hey what’s changed" and it says "oh 732, nothing"
even though I’ve had a whole days changes, because they have reused 732
so you then don’t back up any of the blocks that changed that day and never will
that is stupid, It gets even more FUBAR if you use snapshot trees they reuse ChangeIDs all over the place and it never gives changes back to the base VMDK, it gives it to the last ChagneID in the DB, Which means you can miss bits, or get wrong parts, when you use the ‘goto’ in the snapshot manager.
Now if a revert caused the UUID to change that would fix everything, or they should always give the changes from the base ChangeID to the current change id, not from the last ChangeID to the current ChangeID.
i ask changes for 3
i get 2->3
they should give me 1->3
i could then ask changes for 2
1->3 minus 1->2 is the difference
so I get the same functionality
but they way they do it, i lose 1->2 if I don’t have changeID 2
and remember the ChangeID increments every time a user adds a ChangeID
try it with VDR see if you get the same thing
Now this this is the worrying part it affects any product that uses CBT, so not only esxPress, but the Veeam products and vDR too.
My next question was “So what can be done about this then?”
We put a work around in our code, if a backup is taken of a machine has a snapshot we force the changeID to * what this means it it will copy ‘all dirty blocks’. Now this is obviously slower, but it is 100% reliable.
Now if verified this is a pretty serious flaw, I think you will agree.