OSGi: The Value Proposition?

In a recent blog Hal Hildebrand argues OSGi’s value proposition in terms of its ability to reduce long term ‘complexity‘. Hal argues that whilst it may be harder to start with OSGi as it initially appears more complex, for large applications and large teams it is ultimately simpler because the architecture is ‘modular’. A diagram along the lines of the following is used to emphasis the point.

Complexity Over Time?

Complexity Over Time?

As an ex-Physicists, I’m naturally interested in concepts such as ‘Complexity’, ‘Information’ and ‘Entropy’; and while I agree with Hal’s sentiments, I feel uneasy when the ‘complexity’ word is used within such broad brush general arguments. Indeed; I find myself asking, in what way is a modular system ‘simpler’? Surely a modular system exposes previously hidden internal structure, and while this is ‘necessary complexity’ (i.e information describing the dependencies in the composite system), the system is never-the-less visibly more complex!

For those interested, the following discussion between physicists at a Perimeter Institute seminar concerning ‘information’ is amusing, illuminating and demonstrates just how difficult such concepts can be.

Before attempting to phrase my response, I visited Kirk Knoernschild’s blog – IMO one of the industries leading experts in modularisation – to see what he had to say on the subject.

Sure enough Kirk states the following:

As we refactor a coarse-grained and heavyweight module to something finer-grained and lighter weight, we’re faced with a set of tradeoffs. In addition to increased reusability, our understanding of the system architecture increases! We have the ability to visualize subsytems and identify the impact of change at a higher level of abstraction beyond just classes. In the example, grouping all classes into a single module may isolate change to only a single module, but understanding the impact of change is more difficult. With modules, we not only can assess the impact of change among classes, but modules, as well.

Hence, Kirk would seem to agree. As one modularises an application, complexity increases in the form of exposed structural dependencies. Note that one must be careful not to confuse this necessary complexity with accidental complexity; a subject of previous blog entries of mine – see Complexity Part I & Part II

OSGi – Preventing ‘System Rot’?

Those who have worked in a large enterprise environment will know that systems tend to ‘rot’ over time. Contributing factors are many and varied but usually include:

  • Structural knowledge is lost as key developers and architects leave the organisation.
  • Documentation missing and / or inadequate.
  • The inability to effectively re-factor the system in response to changing business requirements.

The third issue is really a ‘derivative’ of the others: As application structure is poorly understood, accidental complexity is introduced over time as non-optimal changes are made.

Hence, rather than trying to frame OSGi’s value proposition arguments in terms of ‘complexity’ – OSGi’s value is perhaps more apparent when framed in terms of ‘necessary information’ required to manage and change systems over time?

Structural information loss over time for modular and non-modular System

Structural information loss over time for modular and non-modular System

Unlike a traditional system, the structure of a modular System is always defined: The structural information exposed by a correctly modularised system being the necessary information (necessary complexity) required for the long term maintenance of that System.

In principle, at each point in time:

  • The components used within the System are known
  • The dependencies between these components are known
  • The impact of changing a component is understood

However, the value of this additional information is a function of the tooling available to the developer and the sophistication of the target runtime environment.

The Challenge: Simplifying while preserving Flexibility

Effectively dealing with both module and context dependencies is key to realizing OSGi’s true value in the enterprise.

To quote Kirk yet again:

Unfortunately, if modules become too lightweight and fine-grained we’re faced with the dilemma of an explosion in module and context dependencies. Modules depend on other modules and require extensive configuration to deal with context dependencies! Overall, as the number of dependencies increase, modules become more complex and difficult to use, leading us to the corollary we presented in Reuse: Is the Dream Dead:

The issue of module dependency management is well understood. Development tooling initiatives are underway to ease module dependency management during the development process; an example of which being the SIGIL project recently donated by Paremus to the Apache Felix.

However, Kirk’s comment with respect to ‘context dependencies‘ remain mostly unheard.
From a run time perspective vendors and early adopters currently adopt one of the following two strategies:

  • Explicit management of all components: Dependency resolution is ‘frozen in’ at development time. All required bundles, or a list of required bundles, are deployed to each runtime node in the target runtime environment; i.e. operations are fully exposed to the structural dependencies / complexities of the application
  • Use of an opaque deployment artifact: Dependency resolution is again ‘frozen in’ at development time. Here the application is ‘assembled’ at development time and released as a static opaque blob into the production environment. Operations interact with this release artifact, much like today’s legacy applications. While the dependencies are masked, as the unit of deployment is the whole application, this decreases flexibility, and if one considers the ‘Re-use Release Equivalence Principle’ partly negates OSGi’s value proposition with respect to code re-use.

Both of these approaches fail with respect to Kirk’s ‘context dependencies’. As dependencies are ‘frozen in’ at development time there is no ability to manage ‘context’ dependencies at runtime. Should conditions in the runtime environment for whatever reason require a structural change; a complete manual re-release process must be triggered. With these approaches, operational day to day management will at best remain painful.

In contrast, leveraging our Nimble resolver technology Paremus pursue a different approach:

  • The runtime environment – a ‘Service Fabric’ – is model driven. Operations release and interact with a running Service via its model representation; this an SCA description of the System. Amongst other advantages, this shields the operations staff from unnecessary structural information.
  • The Service Fabric dynamically assembles each System resolving all modules AND context dependencies.
  • Resolution policies may be used to control various aspects of the dynamic resolution process for each System; this providing a higher level policy based hook into runtime dependency management.
  • The release artifacts are OSGi bundles and SCA System descriptions – conforming with the ‘re-use / release equivalence principle’.
  • The inter-relationship between all OSGi bundles and all Systems with the Service Fabric may be easily deduced.

The result is a run time which is extremely flexible, promotes code-reuse, whilst is significantly easier to manage than traditional environments. OSGi is an important element, but the use of a high level structural description used in conjunction with the model driven runtime are also essential elements of this story.

OSGi: The Value Proposition?

The short answer really is – “it depends on how you use it”!

Without a doubt, many will naively engage with OSGi, and will unwittingly increase operational management complexity beyond any benefits achieved by application modularization; see ‘OSGi here, there and everywhere’. However, for those that implement solutions that maximize flexibility and code-reuse, while minimizing management, OSGi’s value proposition is substantial; and the runtime used is a critical factor in realising these benefits.

How Substantial?

To date my only benchmark is provided by an informal analysis made by a group of architects at a tier 1 Investment Bank in 2008. They estimated the potential OPEX cost saving per production application, assuming that it were replaced with a Service Fabric equivalent; for the purpose of this blog one may equate Service Fabric to adaptive, distributed OSGi runtime.

Cost savings in terms of

  • application release efficiency.
  • ongoing change management,
  • fault diagnostics and resolution
  • efficiency savings through code re-use

were estimated. The final figure suggested a year on year OPEX saving of 60% per application. Somewhat surprised at the size of the estimate I’ve challenge the group on several occasions, each time the response was that the estimates were conservative.

To turn this into some real numbers – consider the following. A tier 1 investment bank may have as many as ~1000 applications; each application typically costing $1m per annum. Lets assume that only 30% of the applications are suitable for migrating to the new world – we’re still looking at a year on year saving of $200m. Migration costs are not included in this, but these are short term expenses. Likewise neither are the cost savings realized by replacing legacy JEE Application Server and middleware with the Service Fabric solution.

As always – ‘mileage may vary’ – but never the less, quite a value proposition for OSGi!

How Nimble is your OSGi runtime

Hands up all of you managing OSGi dependencies via an editable list of bundles. Easy isn’t it! It just works right!?

Well actually – it ‘just works‘ for a single application running in a small number of containers. From an enterprise perspective you are unintentionally contributing to an impending complexity meltdown; an explosion of dependency and configuration management issues. And if you are unfortunate enough to end up supporting your own composite creations, you may well end up envying the fate of Prometheus and rueing the day you learnt to code.

Possible harsh? But I’m not alone voicing this concern!

In his recent article Reuse: Is the Dream Dead?, Kirk Knoernschild continues his efforts to educate the industry on the tensions between code ‘re-use‘ and ‘simplicity of use‘. Kirk argues that as you increase potential re-use via lightweight fine-grained components, the complexity of dependencies and necessary environmental configurations corresponding increase, so making these same components harder to use.

A simple concept, yet if unaddressed, an issue that will make your life as an enterprise developer increasing uncomfortable and help edge OSGi closer to that seemingly inevitable ‘trough of disillusionment‘.

Yet, from a development perspective the issue of dependency management is well understood.

sigilWhilst initially found wanting, a number of projects now exist to address this; including the SIGIL eclipse plug-in which Paremus recently contributed to the Apache Felix project, (SIGIL leveraging Peter Krien’s BND tool).

In contrast, the issue of dependency management in Production is less immediately obvious, its impact more profound and generally ignored.

  • Will aspects of the runtime environment affect the runtime dependencies within the application?
  • Will applications be isolated from each other, or might they run within the same JVM?
  • How are the released artifacts subsequently managed in the production environment with respect to ongoing bundle dependency and version management?

Echoing Kirk’s concerns, Robert Dunne started his presentation at OSGi DevCon Europe with the observation that; ‘whilst modularity was good, its benefits are often undermined by dependency and configuration complexity‘. The subject of Robert’s presentation? The Paremus Nimble Resolver, which is our response to the concerns posed by Kirk.

Nimble is a high performance runtime dependency resolver. To deploy a composite application to a Nimble enabled runtime (i.e. the Paremus Service Fabric) one specifies:

  • The root component of the artifact.
  • And a set of associated policies and constraints.

Nimble then does the rest.

Presented with the ‘root‘, Nimble dynamical constructs the target composite; ensuring that the structural dependencies are resolved in a manner consistent with both organizational policies and the runtime environment within which it finds itself.

Nimble’s OSGi capabilities include:

  • Fragment attachment policies.
  • Optional import policies.
  • Import version range narrowing.
  • The ability to resolve dependencies on extender bundles (DS, ‘classic’ Spring, Spring DM, iPOJO).

With Nimble policies allowing:

  • The configuration of selected extensions.
  • Flexible constraint requirement -> capability matching.
  • The ability to configure optional dependency resolution behaviors.

Not just OSGi, Nimble is a generic artifact resolver with a plug-able architecture. Any artifact type may be supported, with support currently available for:

  • OSGi Bundles
  • POJO’s, ‘classic’ Spring & Spring DM
  • WAR
  • Configurations.

A Nimble enable runtime quite literally dynamically assembles all required runtime application and infrastructure service dependencies around the deployed business components! Specify a WAR artifact and Nimble will instantiate the appropriate Servlet engine dictated by runtime policy attached to the WAR; i.e. Tomcat or Jetty Sir? Specify a ‘Configuration‘, and Nimble responds by installing the target of the configuration, and of-course its dependencies.

Nimble not only directly addresses Kirk’s concerns, but goes on to radically transforms our understanding of the responsibilities and capabilities of next generation composite aware Service Platforms. But most importantly, Nimble was created to enable effect re-use whilst making life simpler for you and the organizations you work for.

Teleport or Telegraph?

If this blog entry were chiseled in stone, no currently existing technology would be capable of near instantaneous transportation of that stone. Perhaps quantum entanglement might one day provide the basis for Teleportation – yet much serious physics and engineering would be required to make this more than Science Fiction.

Yet the same information – in an binary format (Morse) – could have be transmitted across a continent at near the speed of light over a hundred years ago.

Both approaches achieve the same result – transmission of information.

Sometimes identifying the correct approach, the correct perspective, is far more important than the amount of engineering effort you throw at a problem.

Which brings me to the following article

So VMware need 2,000 people to build a resource orchestration layer? Certainly, trying to manage a resource landscape so that it appears unchanging to a population of legacy applications is extremely difficult!

The alternative?

Take a different perspective.

Build dynamic / agile applications that adapt to the changing characteristics of their operational environments.

Impaled on the Horns of an OPEX Dilemma

Impaled on the Horns of an OPEX Dilemma

The finance industry are clearly having a tough time at present. As losses mount, CEO’s & CIO’s are increasingly scrutinizing the costs of doing business. One interesting metric, the cost of running a single production application; $1,000,000 per annum! Take the many thousands of applications typically used in a large finance house, and operational costs rapidly exceeds the billion dollar per annum mark.
Why is this?

Surely, over the last few years the Finance industry has increasingly driven down the price of enterprise software, to the point that an application server may now be charged at a few hundred dollars per node. Likewise, basic networking, server and storage are cheaper than at any time in the past.

The problem isn’t the cost of the raw materials, rather the fact that these organizations have built increasingly complex environments which must be maintained by an army of IT staff.

I’m probably not far off the mark suggesting 80% of the annual cost for each application relates to support and development staff that are required to maintain and keep the application running.

And the choices available to the CxO?
  • Use Cheaper Resource: Ship operations out to China, India or Mexico! While on-paper attractive as a quick fix; there is a catch. Wages tend to normalize as time progress, with the cost of initially cost effective workforces rising to the point that the Market will bear. Indeed – it has a name; “Free Market Dynamics”. Hence within a reasonable timeframe (~5 yrs) – the cost advantage will evaporated; meanwhile the company is still left with a complex manually intensive operational environment. Traditional – third party outsourcing – of which there are several failed examples exist in the late 1999 / early 2000 period – fall into this category. This approach does nothing to address the the root cause of the spiraling operational costs – complexity! In short – a strategy guaranteed to fail in the medium / long term.
  • Reduce the Number of Applications: If the cost relates to the number of applications – simply forcing down the number of applications in use will initially reduce OPEX costs. Whilst a reasonable strategy for some, the Financial Service industry is highly adaptive and constantly needing the evolve applications and services. Hence, a “no new” applications policy merely results in bolt-ons of additional functionality to existing systems – increasing complexity and associated costs of the remaining applications.
  • Use Technology to Reduce headcount: The IT industry have collectively failed to provide real solutions to this! Despite a flood of Automated Run-Book, Monitoring, Configuration Management, Package / OS Deployment and Virtualization Management products, humans are still very much still “in-the-loop”; directly concerned with all aspects of every software service in the runtime environment. Environments are more complex than ever!

So what is stopping the IT industry developing the right products? Simply, industry continues to fail to realize that automation of the existing is not sufficient. A fundamental/radical change in perspective with respect to how distributed systems are built and maintained is needed to address the Complexity Crisis organizations now face. Funnily enough, this is whatInfiniflow has been developed to address.

And the users of the technology?

  • The fear of change!
  • The linear relationship between status and managed headcount.
  • And most importantly, a severe shortage of sufficiently talented engineers and architects that have the vision and determination to drive such changes through their organizations – (Paremus referring to these rather special individuals as Samurai).

So if you are a frustrated Samurai, contact us at Paremus, we can introduce you to many like minded individuals 🙂

Meanwhile, if you are a CEO / CIO with the desire to tackle the root causes of your organizations IT complexity – why not drop me an e-mail, and we’ll explain how we might be able to help; specifically you may find the dramatic impact that Infiniflow has on operational cost of great interest.

Complexity Part II

Complexity – Part II: It all depends on the Question you ask!

I previously argued that the apparent complexity of a system varies dramatically with respect to the type of question you ask. The answer to one question may make a given system seem inordinately complex, yet ask another similar question, from a slightly different perspective, and the same system appears very simple.

Hence, it is the question that dictates where the line is drawn separating hidden and exposed system complexity.

Assume I want to deploy a set of services to an Enterprise. These services have specific runtime requirements and interdependencies. The usual question asked is…

  • What compute resources do I have, what are their individual configurations and capabilities?”

The response to which, an extensive list of resources and associated configurations/capabilities are presented, that now need analyzing. Like the positions of nodes in a lattice, the initial question, and subsequent answer, expose too much unnecessary information!

In contrast, if I ask,

  • “Out of the cloud of potential resource which may or may not exist, what sub-set resources currently satisfies the following conditions?”

The response requires no further thought. Whilst I may never know the configuration of everything, I’ll always know whether there are resources capable of servicing my stated requirements. As the response to the question is simple, and requires no effort on my part, I have no issue in re-asking the question as may times as required; this is essential, as the one thing I do know is that the environment WILL change!!

Re-visiting the lattice analogy.

Because it is simple to measure emergent macroscopic properties such a pressure, temperature and volume, it is easy to re-measure these and so deduce the relationship between them over time – e.g. Boyles Law. This would have been a significant challenge if the microscopic quantities of position, mass and velocity for each particle had been used instead!

Abstraction versus Virtualization?

Resource abstraction is different from resource virtualization. Whilst the latter attempts to represent a physical resource with a “virtual” equivalent, this equivalent emulating the attributes as the underlying entity, resource abstraction masks the complexity of the entity (physical or virtual), representing this resource via a simplified description. Resource abstraction and resource virtualization are orthogonal / complementary and interdependent.

To Conclude

  • As systems become increasingly distributed and composed of an ever increasing number of moving parts – we need to step back from attempting a microscope description of the environment, and rather describe it in terms of its emergent macroscopic characteristics.
  • We need to intelligently define the boundaries – the point at which microscopic behavior gives way to a more appropriate macroscopic view. Also don’t be surprised if several boundaries exist.
  • Dynamic service discovery / dynamic service provisioning / re-provisioning are fundamental – they are MUST HAVE core behaviors.
  • So avoid all architectures and solutions that assume a static world, comprising of fixed immutable resources at known addresses; NB including wiring systems together via static immutable middleware services! Unfortunately the vast percentage of current software solutions, and the mindsets of the engineers that built them.

Build dynamic systems, manage them with respect to their macroscopic properties and the management / complexity issue vanishes. Conversely, if runtime complexity is a serious issue – it’s about time you redesign / rebuilt your systems as no amount of traditional management software will save you.

Virtualisation: Emperors new Clothes

Paremus colleagues continuing to flag articles to me, and partially restored vigor (New Year and all that) – have conspired to overcome my Q4-07 Blogger’s Block.

I’ll start by briefly para-phrasing the recent crop of virualization articles.

Virtualization is great because…

  • You can increase data centre resource utilization.
  • You can simply restart a service on a new physically platform, should the current physical platform fail.

However there may be some dark clouds on the horizon because…

  • Resource / Dependency Management and Security are problematic
  • Operational Risk may adversely affected

Well, I’ll go to the foot of our stairs!

Isn’t it obvious that, “visible” runtime complexity is increased by current forms of virtualization. Given this, it is surely no surprise that virtualization can negatively impact manageability, OPEX and, ironically, service availability?

Increasing server utilization at the expense of increased runtime complexity seems like a poor trade; especially if you remember that complexity is proportional to the number of skilled personnel required – and so OPEX. From a recent survey by Sun Microsystems ( Sun Survey ) it would appear that many CIO’s would agree.

Meanwhile, we’re told that:

“the IT industry will develop a new generation of management tools to address manageability and security issues created by virtualization. A great opportunity for start-ups and large IT companies alike”.

Doubt this? Then check the current datacentre virtualization hype, and the number of VC funded companies in this market sector. Clearly the adage – “Identify the Pain – and sell them the Aspirin” is still in vogue with our VC friends. Unfortunately such strategies are at best simplistic; at worse, they demonstrate both the level of stupidity only achievable via a fully qualified MBA, and also the lemming behavior of the IT industry.

This time, the patient (Enterprise IT), really does need more than yet another, in a long sequence, of expensive Aspirins.

But perhaps the established IT vendors will address the problem?

Let’s see. How many established vendors after 15 years of client server computing have enterprise management frameworks that are:

  • Simple to Use
  • Cost Effective
  • Simple to deploy
  • Address simple requirements like configuration management for software, server, storage and networks.

Its been a while since I’ve been involved in this area (i.e. HP Openview, Tivoli and the like), but I suspect the answer is still the same.

So what real hope is there for extending such solutions to address the new complications posed by the service virtualization?

I’ll let the reader come to their own conclusions.

Complexity Part I

Complexity – Part I: What would IT Marketing do without it?

For all its press coverage, little effort has been made in defining “Complexity” in a manner that is relevant to the modern enterprise.

So here goes…

We’ll start by imagining two abstract distributed “systems”; each system an infinite 3 dimensional lattice, each in a 3 dimensional space — we’ll avoid distractions caused by none-euclidean geometry :).

However, whereas the first lattice comprises of regularly spaced identical nodes, the second lattice has randomly spaced identical nodes.

Here is the crunch.

Whereas the first regular lattices may be simply, and completely, described in-terms of,

* A description of a node
* A description of the offset of a selected node from your chosen co-ordinate system, and
* The 3 parameters that describe the spacing between the nodes.

In contrast, the second random lattice needs an infinite number of spacing parameters to describe the system to the same level of accuracy.

By choosing to model each system in this manner, the first system is seen as trivial, whereas the second is infinitely complex!

Now, let us assume that relative node position is not important, and that instead we use an emergent property; in this case the density ( the number of nodes within a given volume of space).

Now the amount of information required to describe each system, is identical, and reduces to

* Composition of a node
* Density of nodes in a given volume of space.

Whilst “density” is only an abstract concept, it never the less captures important characteristics of each system with minimal information, so hiding in the case of the random lattice an infinite amount of structural complexity.

I’ll now define System Complexity as, a measure of the amount of information required to describe a System; but crucially, with respect to the System Properties that are of interest to us. Furthermore, by defining/modeling a system w.r.t relevant Emergent properties, we can dramatically reduce the amount of information required to usefully describe the system. The model, representing the System and it’s emergent properties, isolates us from potentially vast amounts of internal structure / complexity.

Also, for a given System, the abstraction / model that optimally describes the relevant emergent properties, with the least information; provides the least complex representation of that system.

Back to IT

Whilst IT provisionals are no longer required to understand:
* The arrangement of silicon atoms required to produce semi-conductors
* The detailed architecture of the processor or memory chip in use
* The firmware used
* The specific considerations in an OS kernel design

The resultant distributed systems are still “complex”; complicated by the fact that they consist of many inter-dependent components and services, each of which must continue to function within a volatile runtime environment.

The response to this “complexity” can be seen in every FT/Fortune 1000 company.

* Attempts are made to lock down the runtime infrastructure, to completely describe it, and prevent changes to it. More recently, attempts are made to virtualize / abstract to runtime infrastructure in a manner that presents an unchanging persona to the static business systems.
* Meanwhile, software middleware is treated as strategic investment – with physical silos of grid computing, ESB’s and data caching introduced into these environments, the mandated then made that these infrastructure services must be used.

What is wrong with this consensus approach? Quite simply, as with the random lattice example, organizations are viewing their systems and so associated system complexity in the wrong frame of reference! And then attempt to address perceived complexity issues with a series of measures that actual drive up operational costs whilst impacting service agility and availability.

Enough for today – next blogging session I’ll provide, what I believe to be, the answer 😉

Venture Capitalists embrace Command Economy in preference to Free Market!

Venture Capitalists embrace Command Economy in preference to Free Market!

A recent article Interesting Times for Distributed DataCentresby Paul Strong (Ebay – Distinguished Research Scientist ) makes a number of interesting points:

  • For Web2.0 Services to scale, you MUST back-end these onto massively horizontally scaled processing environments.
  • Most Enterprise datacentre environments are moving towards, or could be considered as, priomordial Grid type architectures.
  • What is really missing is the Data Centre MetaOperating System – to provide the resource scheduling and management functions required.

Whilst these arguments are correct, and highlight a real need, Industry & VC response seems entirely inappropriate.

Whilst VC and major Systems Vendors are happly throwing money into expounding the virtues of loosely coupled business models enabled by Web2.0 and all things WS-SOA; somewhat perplexingly, they also continue to invest in managment / virtualization / infrastructure solutions which drive tight couplings through the infrastructure stack. Examples include data centre “virtualization” or, as per my previous blog entry on the Complexity Crisis, configuration / deployment management tools.

Hence, industry investment seems to continue to favor the technology equivalent of the “command economy” in which the next generation of distributed Grid data centre is really just one more iteration on today’s; central IT organisation control/manage and allocate IT resource in a rigid hierarchical/control command structure. The whole environment is viewed as rigid system which one centrally controls at each layer of the ISO stack; approaches that continue the futile attempt to make distributed environments behave like MainFrames!

What is actually needed is a good dose of Free Market Economics!

  • Business Services dynamically compete for available resources at each point in time,
  • Resources may come and go – as they feel fit!
  • Infrastructure and Systems look after their own interests, and optimise their behaviors to ensure overall efficency within the Business Ecosystem.

Successful next generation MetaOperating Systems, will heavily leverage such principles at the core of their architectures!

You simply cannot beat an efficient Market!

Complexity in the Traditional Enterprise

A new survey posted on GRID todayhighlights the Risks associated with Infrastructure Complexity. Interesting highlights include:

  • Each hour of downtime costs Fortune 1000 companies in excess of $300,000 according to 1/3 of the survey responses.

Of course, dependent on the specific Industry, these figures could be so much larger! Everyone tends to focus on availability/scaling issues for the new Internet based companies (Google, Yahoo, Amazon, Ebay). However, if you want to see real risk – consider the impact on some of the core systems that support global Banking / Financial systems.

  • Trouble shooting the problem can take more than a day. According to 1/3 of survey responses.

So if these are the same guys that have the $300,000 an hour loss – the figures are starting to mount up.

  • Change Management for Fortune/FT 1000 companies occupies 11 full time people!
  • Installation and configuration of core applications is a major resource sink; taking 4 days to configure a complete application infrastructure stack.

The report then goes on to justify change management / configuration management products. The implication being that to address the complexity issues, these Fortune/FT 1000 companies need to purchase and configure yet more enterprise software?

So Layering Complexity upon Complexity!!

I wonder, just what is the Production impact, if after all this automation, you loose the systems that are doing the automation and configuration?? I suspect recovery would be significantly longer than 1 working day!

The truth of the matter is that Enterprise Systems including those based upon the latest ESB, Grid, WS-SOA Marketectures are the root cause of the explosive increase in Complexity.

Each of these approaches implicitly assume that:

  • The compute resource landscape is static,
  • Software functionality is static
  • Provisioning is thought of as a one time event, and
  • Failure is treated as an exception.

Whereas in reality:

  • Compute resource landscape is dynamic
  • Software functionality needs to evolve and adapt
  • Provisioning is an on-going process – caused by
  • Continual – re-optimisation against the shfting compute landscape and recovery from failure.

So how do these Fortune/FT 1000 companies dig themselves out of their current Complexity Crisis?

By building the correct IT foundations for their businesses! Fortune 1000 companies need to implement Enterprise wide solutions where configuration, adaption and recovery are core design features. Systems configure, deploy and maintain themselves, as part of what they do (by way of an example – see Infiniflow)! Such solutions will also heavily leverage industry trends towards modularization via OSGi & SCA.

Whether you are the CIO of a Global Bank, a Gaming Company or a Telcoms company, once the correct technology foundations have been put in place – no easy task – significant OPEX savings WILL follow. However, take the easy route – fail with the foundations, avoid necessary change – and no amount of management, configuration or deployment software bandaid will save you!

The “Hidden Costs of Virtualization”

The “Hidden Costs of Virtualization

An interesting article arguing that whereas OS virtualisation is sold on the costs savings achieved by higher CPU utilisation of existing resource, several cost factors seem to be overlooked. The most important of these being that operational costs scale with number of OS instances ; it immaterial whether these instances are OS’s real or virtual. The article also points out that commercial OS virtualization software is not cheap; around $20,000 for VMware ESX for a 4 way Intel box, though open source solutions should in due course pull this pricing down.

However, I’d suggest that there are a number of additional considerations.

OS virtualisation, in itself, does nothing to address inherent complexity issue within modern enterprise environments. Instead of sprawl of physical machines with poorly managed applications and configurations, one can now extend these complexity and management issues into a virtualized resource space! Obviously, OS virtualization management is needed, and indeed provided, as commercial products, by virtualization vendors.

In many respects are we not back to where we started? Sure, we can now drive up CPU utilisation, but the runtime infrastructure is more complex than ever. Meanwhile business applications are still as brittle, as tightly coupled, as change resistent as they ever were!

Also, is increased CPU untilisation, at the cost of increased complexity, a good trade?

Driving up CPU utilization has got to be good, right? Indeed, many CIO’s want to make dramatic OPEX cost savings by driving CPU utilisation to ~100%.

Yet, whilst running a large datacentre’s CPU resource at single figure utilisation levels is an obvious cost issue, what seems to be overlooked are the issues associated with running resources at near maximum utilisation.

Load volatility is an obvious concern. If you achieve, on average 80% utilization across your resource population, just how do you cope with peaks that require say 50% more resource? The standard response may be to outsource the extra resource requirement to a third party utility compute supplier. Yet whilst frequently discussed by the industry, I’m not aware of many over-capacity deals. Quite the contrary, early entants into the Utility Compute Market have recently dismanled facilities due to lack of commerical interest.

Yet, there is a more important issue; namely operational risk. Evidence suggests that compute resource under excessive load is statistically more likely to experience software failure (reference). Moreover, any complex tightly coupled system may suffer cascading failures ; i.e. an initial component failure cascading into a major system outage.

Hopefully, such cascading failures are the exception, however component failure within a heavily loaded environment will always be more intrusive than an for a lightly loaded equivalent, as dependent upon priority of lost service/component, other running services may need to be terminated to free sufficient resource.

Hence, Operational Risk and Data Centre Resource Utilization are issues that are fundamentally linked; linked by the sizes of the potential failure domains within the system.

Failure domains may be defined by / mapped to:

  • Physical locality / Physically shared resource.
  • Hardware type
  • Software type / version
  • Management / Security domains

Considering “physical” failure domains. The following domains usual exist:

  • A data centre facility – (complete power failure, halon release, collapse of network infrastructure)
  • A shared PDU – ( possibly affecting 25% of data centre resource – assuming critical systems like San storage and IP Newtorking – a wired into at least two PDU’s)
  • A network switch failure – perhaps impacting 50 servers if each is single homed.
  • A cabinet failure – perhaps affecting the 100 processors in that cabinet.
  • Single, physical machine.

Without spare usable capacity, re-provisioning OS instances caused by a single cabinet or even machine failure may prove challenging.

Perhaps we’re need to think about things in a slightly different way?

Conventional Wisdom:

Large datacentres are running out of space and / or limitied by environmental conciderations (power, air-conditioning requirements). Usually, there is little possibility of building a secondary large datacentre facility within the Metro area, because cost can by substantial, and suitable real-estate is not available. Hence, virtualize compute instances and maximally use what resource is already there.

The Alternative:

Instead of a single large datacentre facility, adopt a modular datacentre; these distributed over a larger geographic region. The largest failure domain that we care about is a datacentre module instance, so for ‘N’ active modular datacentres, we need ‘N+1’ to allow for complete failure of any instance. So by modularizing and virtualizing the datacentre, we actually increase the ability to use spare CPU resource per datacentre instance, without impacting operational risk.

In this respect Sun’s Black Box strategywas an interesting marketing move. I say “marketing” as in reality the hardware required to realise a modular datacentre centre is the easy bit! The difficultly lies in the unstated, but very real requirement, for a distributed/extremely robust composite application resource broker that seemlessly knits these modular resource together into one robust distributed Enterprise/Utility Service Fabric.