High uptime equates to low mean time to recovery

In many instances I have heard people discussing and then trying to measure the mean time between failures -(MTBF) of components in their architectures.  While this may be an interesting exercise it is typically misguided.

The focus needs to be toward measuring the mean time to recovery (MTTR) after failure.  Assume the failure.  Going through this exercise very quickly puts the critical issues at the forefront.

For example one may have the following analysis

  • Java OOM (out of memory) : MTBF = 1 week.   MTTR = ZERO  (loadbalancer redirects traffic to other node and the OOM node restarted automatically)
  • 1 critical Data Pipe with 99.9% SLA provided.  Lets say that MTBF = 5 years.  This means that if and when the downtime occurs – lets say once every 5 years — theoretically you could be down for 40 hours! and still be at 99.9% SLA.

Would you be OK with a MTTR of 40 hours?  Probably not.  Imagine that this 0.1% failure case happened tomorrow and then for the next 5 years all is good.

Typically what happens is that we deal with what we see and feel.  In this case the weekly out of memory issue would be getting all the attention.  However be careful of the silent but business crushing High MTTR.

Go through a MTTR exercise for ALL aspects of your architecture and business.  The path to high uptime will become clear.

Are SaaS offerings less reliable than on premise solutions?

I contend that Software As a Service offerings are not less reliable than on premise solutions as measured by uptime.  As an example lets compare the following scenarios

Scenario 1 : SaaS

Lets assume that a SaaS provider has 1,000 customers. Lets assume that this provider has a track record of 1 severe outage every year and the outage lasts 30 minutes on average.  All customers are impacted since the outage is severe.  Therefore every customer experiences 30 minutes of downtime every year.

Scenario 2: On Premise

Lets assume that the 1,000 customers have an on premise system (same or similar solution in complexity/offering) to the SaaS scenario.  Following are the questions that need to be answered

  1. What is the likelihood that a particular customers on premise system will experience a similarly severe outage as the SaaS provider?.   Lets say that one believes it to be  0.50.  So a system is half as likely to go down if on premise versus in the cloud.  So 500 of the 1000 customers will experience the outage.
  2. Now here is the key : Even if you believe the above to be true, what is the likelihood that the on premise IT support staff will be trained in high availability and have failover mechanisms(other than a restart), have a NOC etc for the on premise solution(s) -  in order to bring up the system within  30 minutes? I would contend that this will become a multi hour outage as the large majority of the internal IT staff are just not trained to respond to emergencies. In addition they just do not have the tools at their disposal to deal with severe outages such as data center failovers, redirecting traffic etc. Also they do not know the product intimately as the SaaS provider.

So what does this all mean?

  • It means that even if you believe that SaaS offerings may be less reliable one better take into account the notification times and restore times into account when comparing the two options.
  • Measure the operational capabilities of your organization.  It is very likely the emergency response procedures for the on premise solution have not been defined or practiced.
  • Even if you think that the On premise system will be 2 times more reliable than the SaaS offering, that advantage means nothing if your restore time is more than twice as long as the restore time of the SaaS provider.
  • Uptime should be considered within the context of an organizations own operational abilities.

The above quantified

Following are the key parameters that need to be thought through and measured over a defined period of time.

(Probability of Severe Outage) * (Notification Time  + Restore Time) = Expected Downtime

Plugging in numbers one may get

(1) * (1 min + 29 min) = 30 minutes.   -  For SaaS

(0.5)*(20 min + 180 min) = 100 minutes – For On Premise

The above example shows that an on premise system is more reliable than a SaaS offering but has lower uptime.    Even if one gives the on premise systems the edge in reliability(debatable), that does not automatically mean that this will result in higher uptime.  Restore times(failovers) are the key drivers for high uptime.

Follow

Get every new post delivered to your Inbox.