With all the the hype about moving towards the cloud, it does make you wonder sometimes: “Is the public cloud and its current infrastructure ready?”
Especially if you look at recent outages.
Yesterday and today the outage of skype which looks like it is going to take 2 days for it to resolve. Reason? It had a software update that went wrong (oops)
About a week ago facebook was off line, apparently also due to updates.
Twitter even has invented a new term for them having capacity problems, called the “fail whale”
With VMware you’ve might have noticed that our beloved VMTN forums have gone offline during the update for a whole weekend and they’re still coping with upgrade pains.
In November SIDN the dutch .nl tld registrar registration administration was down for days. Here the reason was diagnosing some problems and then while trying to resolve it, getting drowned into more problems such as a broken switch followed by a database corruption
Also a week ago, a complete data center here in the Netherlands went dark for an hour due to a power failure, getting everything back up running reliably of course took many hours.
Obviously a whole data center going off line is more serious as twitter not being online. But the question to ask yourself is how much do you depend on someone else’s infrastructure? What does it cost if that goes off line? Do you have a procedure to follow for if your colocated servers go down hard?
Do you think you can do better when hosting locally? The infrastructures used in my example are not exactly peanuts and is designed by professionals who do this full time, yet we still see problems. If you are running a private cloud, you most likely have direct access to the hardware and will be able to work on a fix and have a direct influence on any problems that might occur. But you might also create more havoc as you’re not as likely to have all the plans for these type of problems. If you do run your own fully redundant cloud, do you have regular test windows to verify that your expected redundancy actually works the way it should?
In the case of using a public cloud, you might not even be able to contact the party you depend on. The data center example here didn’t really communicate about the failure until 6 hours after the problem. Their phonelines use VOIP.. (of course) and these lines were busy when they worked again anyways as everyone wanted to contact them about the failure at the same time. Their twitter feed was silent until 5.30 hours after the incident. This was a huge missed opportunity. How will you communicate to your customers if your supplier isn’t communicating with you?
Did your company calculate how much it is going to cost if the new public cloud you will use will not be available for a whole afternoon? Does the SLA you have with your supplier reimburse you enough for the damage is that you have in such a case? I don’t think so.
Going to the cloud gives you a whole new set of problems to think about and it varies from legal to technical as in the examples here. Not every one of those issues can be solved by running on a specific virtual platform. There’s a LOT more to think about.