In many instances I have heard people discussing and then trying to measure the mean time between failures -(MTBF) of components in their architectures. While this may be an interesting exercise it is typically misguided.
The focus needs to be toward measuring the mean time to recovery (MTTR) after failure. Assume the failure. Going through this exercise very quickly puts the critical issues at the forefront.
For example one may have the following analysis
- Java OOM (out of memory) : MTBF = 1 week. MTTR = ZERO (loadbalancer redirects traffic to other node and the OOM node restarted automatically)
- 1 critical Data Pipe with 99.9% SLA provided. Lets say that MTBF = 5 years. This means that if and when the downtime occurs – lets say once every 5 years — theoretically you could be down for 40 hours! and still be at 99.9% SLA.
Would you be OK with a MTTR of 40 hours? Probably not. Imagine that this 0.1% failure case happened tomorrow and then for the next 5 years all is good.
Typically what happens is that we deal with what we see and feel. In this case the weekly out of memory issue would be getting all the attention. However be careful of the silent but business crushing High MTTR.
Go through a MTTR exercise for ALL aspects of your architecture and business. The path to high uptime will become clear.
Filed under: High Availability, Software As A Service Tagged: | ha, operations, saas
Would it not be better to accept both metrics but apply weights accordingly? MTBF is very useful im making hardware purchase decisions.
I would agree that MTBF can be useful in comparing like hardware components.
If the component is a part of a High Availability Architecture then the solution with the lower MTTR would win.
I think you made my point exactly, MTTR is generally not considered while making hardware purchasing decisions and it should be – especially for HA.
SLA’s are usually measured in terms of availability over a set time (1 month or 1 year). You don’t “roll over” unused SLA outages, so your outage time after 5 years would be 8 hours, not 40.
If this is a major concern, then you should be placing servers in locations that can be served from multiple carriers over divergent paths.
This is just wrong. Low MTTR is part of the plan but the notion that MTBF doesn’t matter is nonsense. MTBF, MTTR, and a solid SLA are all parts of an equation and it’s my opinion that you improve availability from “miserable” to “pretty respectable” by starting with a focus on reducing incidence of easily prevented failures (which is accomplished by an emphasis on MTBF). Eventually you may start getting close to the metal and be at a theoretical limit on failure incidence and at that point the only way to squeeze higher uptime out of a system is through reducing the time you spend in the breaks that you aren’t able to prevent, but until you’re nearing that point, it’s definitely not a mistake to look at MTBF and it may be a mistake to ignore it.
Thanks for the comments Patrick.
MTBF starts to lose its value very quickly as soon as one starts to manage a service that is being successful and therefore needs to be highly available. Take the following example :
1) Lets say you have reliable servers along with the software on it with a MTBF of 5 years. (Therefore a server will crash after ~1800 days of operation)
2) And lets say that you are managing 250 such servers
3) This would mean that you will have one server crash every week! (250*7 is approximately 1800 operational server days). You just don’t know which server and you better be ready to handle this crash and hope that this does not cause downtime.
If one just focused on MTBF and got servers that were twice as reliable in the above example you would still have a server crash and possible outage every other week.
This leads us back to the original point that that for highly available systems it is MTTR that matters. MTBF just proves that failures will occur. If you have a MTTR of 0 it does not matter much whether the MTBF is 1 year or 10 years. The failure is non service interrupting and in the end that is all that matters.
The only minor exception I would make for the above is for components where the MTTR is not 0. In that case one could try to focus on ensuring that the MTBF is as high as possible for those non 0 MTTR components.
Though I would contend that the time spent on getting a higher MTBF component would be better spent in reducing the MTTR of that component to 0.
Until the MTBF gets to infinity for a component, a 0 MTTR approach will always win out.