Complex Systems and Failure

Tim Harford’s ADAPT was one of those spontaneous airport bookshop purchases

In summary a good read with a relevant message. Short-termism stifles true innovation. It is only by attempting novel high risk activities that we can hope to make substantive changes and ultimately succeed.

ADAPT provides some advice for putting this philosophy into practice:

  • All interesting systems (ecological, economic, social, political) are Complex.
  • ‘Complexity’ is not the issue:  Tight coupling is the issue.
  • Tight coupling propagates failure; tight coupling must be avoided.
  • Information has context. Lose the context and much of the value of the information is lost.
  • Avoid overly centralised command and control. Rather, delegate the decision making process.
  • Where possible, act locally.

Those interested in ‘Complex Adaptive Systems’ will be aware of the substantive body of background research that underpin ADAPT’s arguments.

Why Complex System Fail

And yet these principles are rarely put into practices by the software industry.

Response to Failure: A tightly-coupled system

This is perplexing as the ‘fail fast‘ mantra is not new: it just seems to have been largely ignored. While Berkley’s Recovery Oriented Computing program demonstrated these ideas almost a decade ago; we see little evidence of them being incorporated in the latest ‘Cloud’ & ‘Virtualisation’ platform offerings from the dominant software vendors. Indeed, peel beneath the marketing covers, and the usual suspects continue to pursue ‘High Availability’ or ‘Fault-Tolerant‘ approaches.

This folly of this is nicely explained by ‘How Complex System Fail‘ (University of Chicago’s Cognitive technologies Laboratory). This paper covers some of the same ground as ADAPT but  explains the problem from an IT Operational perspective. Fault-Tolerance masks component failure, and so paradoxically, such systems are more vulnerable to severe cascading or systemic failures.

When these cascading events finally do occur, Operations are the only defence: Operations pick-up the pieces!

The author (Richard I. Cook) argues a number of points; but the following two are to my mind the most important.

  • Safety is a characteristic of systems and not of their components.
  • Failure free operations require experience with failure.

Surely such fundamental principles should be at the core of modern ‘cloud’ platform runtimes? Surely failure recovery must be an integral part of any overall solution? These behaviours exercised as part of normal ongoing runtime activities and not as responses to rare Black Swan events? Finally, surely cloud environments should aim to be truly loosely coupled environments?!

Note: In my book reliance on centralised message brokers, or naive use of rigid ZooKeeper type lock services are part of the problem, not the solution.

 

Stopping Complex System from Failing?

Markov Chain Analysis of a loosely-coupled ‘Target State’ driven platform

Thanks to early exposure to adaptive SOA frameworks like Jini; Paremus developed a strong intuition with respect to the requirements for mission critical ‘cloud’ environments. To provide concrete theoretical foundations, Paremus in 2005 used Markov Chain analysis to simulate the availability of traditional HA clusters; and contrasted these to alternative architectures we internally referred to as ‘No Frame of Reference (NFoR)‘. A ‘NFoR‘ architecture  had no static control points and could continuously re-allocate software components as required.

To achieve this:

  • Component failure was visible within the runtime environment.
  • Loose coupling at all structural layers ensured that failure was effectively isolated.
  • As the architecture was extremely modular, only the smallest units need to be replaced & recovery was rapid.
  • Sophisticated  ‘Target State Driven‘ dependency management automatically replaced the failed units.

The results of the simulations where clear. A ‘No Frame of Reference‘ runtime platform embodying fail fast and automated repair and recover behaviours significantly outperformed traditional static high availability alternatives.

 

Even in the most volatile of environments with multiple failures being rapidly injected; such platforms always settled back into a functional state.
Such solutions, because of their extreme agility could also be rapidly reconfigured, shutdown and re-started by Operations.

 

 

 

Hence, our own experiences were consistent with the advice offered by ADAPT and  Richard I. Cook’s paper.

The following are fundamental requirements.

  • A high degree of structural modularity as Modular System are Maintainable Systems.
  • Loose coupling between interactive software components (locally or network distributed).
  • Loose coupling between components and the underlying resources (physical or virtual).
With the following implications:
  • Structural modularity requires powerful dependency management.
  • Resource abstraction requires sophisticated ‘Target State‘ provisioning / re-provisioning capabilities.

And for Paremus, the OSGi software modularity framework provided a compelling set of industry standards via which these capabilities might be achieved.

It is worth emphasising that such capabilities are not a function of the programming language used. Choice of language does not in itself  provide an answer: just the notation you might use to realise an answer. For this reason I see the increasing adoption of the OSGi modularity system as far more significant that recent Java developments or even the emergence of languages like Scala.

Nor is resource ‘virtualisation’ relevant in achieving this goal. Virtualisation is an orthogonal and secondary concern! If you need to partition physical resource – by all means use virtual machines. If you need to partition a data-centre pursue a SDC (Software defined Data Centre) strategy. But tread carefully! These solutions do not address your fundamental issues and risk the introduction of yet another complex tightly coupled management layer.

That’s all for today!

If you are interested in further detail on Service Fabric concepts: see Paremus Service Fabric Concepts and Terminology.

One thought on “Complex Systems and Failure

  1. Pingback: Hardy DE SOUZA

Comments are closed.