12 October 2009

Danger in the clouds!

OK, everyone, let's take a deep breath here.

Any kind of architecture might carry the risk of losing all your customer data in one catastrophic event -- if it's poorly designed or poorly operated. Microsoft/Danger's loss of customer data was due either due to a design failure or an operations failure or both. It's also possible that the loss was due to a calculated risk: that known design or operations flaws were nevertheless judged unlikely to lead to a loss. Or all three. Either way, it was not due to any particular feature of cloud computing.

Secure computing gets a lot of attention today -- but mostly in the attacker/malware aspects.  To be sure, malware and other attacks are significant and serious risks. But another important aspect of security is reliability. Even if your data were absolutely impregnable to attackers, in the event of irrecoverable data loss your customers are just as out of luck as if they'd been hacked. There are decades of best practices on how to maintain data reliably, yet naturally, losses still occur.

It's impossible, of course, to guarantee against data loss with 100% certainty. There are always events which have some small yet finite chance of occurrence that are capable of causing catastrophic loss in any system. And in the real world of engineering, there is always a trade-off between cost and function. Generally speaking, the more you engineer a system to be reliable, the more the system costs. Money isn't unlimited, and so there is only so much reliability one can realistically achieve with any given budget.

The best kind of risk is the one you're aware of -- the one you can calculate the chance of occurring, the cost if it does occur, and how to recover from the event. Not every risk is even imaginable, much less predictable. In the best case, Microsoft/Danger was aware of the kind of risk that existed in their system, engineered appropriately around that risk, operated with awareness of that risk, and simply got hit with an unlucky event. In the worst case, they were unaware of a poor design or slipshod operations.

None of the above has anything to do with 'cloud computing'.

Now, 'cloud computing' does have some general features which do change the risk profile of these sort of events.

  • In contrast to a system where customer data is always stored locally to the customer, a system where data is stored centrally is more likely to experience a loss of data across multiple customers. But central data storage is neither a necessary nor a sufficient feature of cloud computing. 
  • In contrast to a system where customer data never transits the Internet, a system were it does cross the Internet is more likely to experience a loss related to such transit. Internet transit is a necessary feature of cloud computing, but not a sufficient one. Customer data travels across the Internet in many other ways that are unrelated to cloud computing.
Proper architectures and proper operations will be made with awareness of these risk profiles and account for them in line with the costs of the system. But hey, that ain't exactly rocket sci...actually, come to think of it, that is rocket science; or at least rocket engineering.

One can argue that the above points represent the very essence of why cloud computing may be more prone to this kind of problem: it encourages centralized data storage and transmission of data across untrusted networks. Well, yes. Different architectures have different risks, different benefits, and different economies. As cloud-based architectures become more and more pervasive, engineers and architects will need to adapt to patterns and models that are appropriate to the cloud and its unique characteristics. (In my opinion, these disciplines should become part of a modern computer-systems education. I digress...)

But this doesn't seem to be the problem in the MS/Danger/T-Mobile case. Sounds like they just screwed up in any number of ways. Lay this at the feet of ordinary human failings, but not of cloud computing.