Whilst the current global economic meltdown and the recent <a href=”http://news.bbc.co.uk/1/hi/technology/7907583.stm”>Google e-mail</a> service outage may seem entirely different types of event, there is some degree of commonality. Both represent catastrophic cascading failure within large complex distributed systems.
The analogy unfortunately finishes there.
Google were up and running again in a couple of hours whilst the worlds economies may take a decade to recover. However the central theme – how to avoid systemic catastrophic failure within complex systems – remains of deep concern to system architects and economists alike.
Where does that leave “Cloud Computing”. Quite simply don’t believe the hype. Public Cloud infrastructures will continue to fail, hopefully infrequently, but almost certainly in a spectacular manner. The next generation for Public Cloud will need to be built upon a more modular resources landscape (swarms of geographically dispersed meshed data centre nodes) – with a suitably advanced distributed & partition-able Cloud Operating System.
Unfortunately the same is true of the current generation of Grid Provisioning and Virtualization Management Software solutions increasingly used by large corporations. Use of this technology will end in tears for a number of large IT departments. To much visible complexity, too little automation. Like the economic meltdown, these solutions fail to account for outlier risks which cause systemic failure within complex systems.
The answer? Well its not a programming language (sorry Erlang!), nor a specific piece of middleware, nor specific replication technology, nor classic clustering.
To start the journey one must first realize that…
Agility and Robustness are simply two faces of the same coin.