In asset-heavy industries, failures are inevitable. Things fall out of alignment, seize, break. But by converting failures into hard, actionable numbers, you can better keep failures smaller, less frequent, and more manageable.
Before looking at how to measure and convert it into metrics, we need to make sure we really understand what failure is. Failure has nuance.
Partial vs complete failures
Generally, failures can be divided into two types, partial and complete. With partial, the asset might still work, but it's not going to be working well. If it's a giant press for pots, for example, they might now be not exactly the right shape. Or, they might be coming out the right size and shape,but at the wrong speed, throwing everything off down the line. In the case of a complete failure, however, the asset stops working altogether. The press just stops pressing.
Are partial failures better than complete failures? It depends on the situation, but it can easily be argued that pressing a bunch of slightly-wrong pots is worse than pressing none at all. At least with a complete failure, with the line coming to a screeching halt, you know there's a problem and can fix it.
There's one more important difference between partial and complete. Complete failures are like being asleep: you are or you aren't. But partial failures exist along a spectrum. They're the same as being tired. You can be everything from a bit sleepy to dead on your feet.
Partial vs complete, a simple example
Consider a bicycle. We can say a complete failure is when the bike's chain slips the gears and comes off all the way. No matter how hard and fast you pedal, you're not going anywhere. But what if just the chain guard comes off? In that case, the bike still works and you might not even realize there's a problem. Moving along the spectrum, we can see failures that are more obvious but still only partial. Imagine someone's gone and stolen the bicycle's seat. With a bit of determination and balance, you can still ride the bike by standing up on the pedals. It's not a complete failure; it's still only partial.
Now that we understand failure, let's look at two important metrics, MTBF (mean time before failure) and MTTF (mean time to failure). Just before we do that, though, let's remember that there's actually a third metric, MTTR (mean time to repair), which is equally as important. We already looked at it in great detail in our blog discussing MTTR. I've included some of the highlights below, but it's worth your time to go and read the earlier post and them come back.
What is MTTR?
Mean Time To Repair (MTTR) is a metric that measures how efficiently the maintenance department gets assets back up and running.
How to calculate Mean Time To Repair(MTTR)
The first thing you need to know is how much time was spent repairing an asset over a set period. Say you have a press with a tricky motor. Over a week, you spend a total of four hours working on it. The first time you work on it for an hour and a half. Then the second time you need another two and a half hours. Something to remember: In this specific case, the lengths of time to repair the asset are fairly similar. This does not have to be the case. You can still use MTTR with very different repair times. So, on another asset, the first time you fixed it, you needed thirty minutes. The second time, three hours. Third time, two days.
It's fine if the lengths of time are very different from one another. But, the people doing the repairs need to be roughly the same in terms of ability and preparation. What you want to know is how long a properly trained professional using a clear set of instructions takes to complete the repairs. If some of the data you're collecting is from a new hire working on an asset without an O&M manual, you're not going to end up with a useful result.
Next, take the total amount of time (which we already said was four hours) and divide it by the number of times you worked on the asset (which we said was two). Your MTTR is 2.
Generally, you want this number to be as small as possible, so once you have it, you can start to look for concrete steps to shrink it.
For example, you might start to think about staffing. Maybe you need more people overall or just more people with specific skill sets. Additional training for current staff might be an option. MTTR is also often used to evaluate which spare parts to keep onsite and to set par levels. If things are taking too long to repair, it could be because tracking down the required parts is taking too much time and effort.
MTTR can even be helpful when deciding to repair or replace an asset. Over the useful life of an asset, the MTTR will trend up and help with reducing equipment downtime. Older assets take more time to repair because their failures tend to be more serious. By looking at the changes to its MTTR over time, the front office can better decide when an asset needs to be replaced or if it makes more sense to keep asking the maintenance department to repair it.
The front office can also use MTTR to make better decisions about which new assets to buy. One growing trend for assets is modular design. Imagine you have to fix one tiny spring in an old wristwatch. Just think of how carefully you would need to take the watch apart, replace that one broken piece, and then put everything back together. It's a nightmare. But if that same watch had a more modular design, when you opened it up, there would be only three "pieces." Inside each piece would be all the same little screws, springs, and whatnots you'd find in a regular watch, but here they'd be housed in easily removed and replaced compartments. When the front office has its eye on MTTR, it's more likely to buy modular assets that are easier to repair, which directly benefits the maintenance department.
What is MTBF: Mean Time Between Failure?
This metric is used to determine reliability. Basically, how long on average will an asset run before it needs to be repaired. The phrase "be repaired" is key here: you only calculate MTBF for assets that can be fixed. For things that can only ever be replaced, for example light bulbs, you use a different metric.
How to calculate Mean Time Between Failure(MTBF)
You need three things: the total number of hours the asset was in operation, the number of times it failed, and the amount of time it took to repair after each failure. You take the total number of hours of operation and divide it by the total number of failures.
One thing you don't need: the amount of time the asset was offline because of preventive maintenance. Calculating MTBF does not include the time you spent trying to avoid problems.
Let's look at a simple example. Say you have a press that ran for 24 hours. During that time it failed twice, and each time it took an hour to get it back up and running. So, it was in operation for a total of 22 hours (24 hours minus the two hours it took for repairs). Twenty-two divided by two, the total number of failures, equals 11. Not a great asset, really, because on average it's going to fail every 11 hours. That's not good.
Value of MTBF
But don't throw that press out just yet. Generally, when you have a low MTBF, it can be traced back to either operator error or issues with how the asset is being repaired. You can likely improve MTBF with additional training and closer oversight.
Not only does MTBF expose issues with past use and repair, but it also helps set up your preventive maintenance schedule for the future. If you know an asset, on average, fails every 100 hours, you can set PMs at every 90 hours. That way you're getting the most bang for your PM buck.
What is MTTF: Mean Time To Failure?
Here again, we're looking at reliability, but now it's for things that can't be repaired. They can only be replaced. The easiest example is light bulbs.
How to calculate MTTF (Mean Time To Failure)
When we looked at MTBF, all the numbers were from one asset. But for MTTF, we need a group of identical failed items. Going back to our basic example, light bulbs, we might have four burned out bulbs, and they ran for 20, 22, 26, and 18 hours respectively. We add up those numbers and get 86. When we divide that by the number of bulbs, which was four, we get a Mean Time To Failure of 21.5 hours.
Value of MTTF
Looking at our MTBF for the light bulbs, we can see right away you're going to need to switch brands, which is really all you can ever do when you have a low MTTF. You can only improve your results by buying better quality products. Mean Time To Failure is the "you get what you pay for" metric.
MTTF also helps you better manage inventory. If you decide to stay with these awful light bulbs, at least you'll know to keep a lot of them in onsite inventory. Later, if you decide to switch to a better bulb, you know you can reduce carrying costs by keeping fewer of them around.
But the real power of MTTF is what it can tell you about the reliability of bigger, more complex assets. In fact, the Mean Time To Failure for a small part inside a large asset can have a huge effect on that asset's reliability. Think about your car. What happens when one of the interior lights burns out? Aside from some minor inconvenience, nothing. But what about the fan belt? Like the light, it falls under the MTTF metric because it can't be fixed, only replaced. But because the car can't run without the fan belt, the fan belt's MTTF can be more important than the car's MTBF when determining the car's overall reliability.
You can only really start to use failure metrics once you have a rock-solid data-collection system in place. Luckily, the easiest way to do that is with an equipment maintenance software or a work order software. If you don't have a CMMS yet, now's the perfect time to look into getting one. Older versions required huge upfront investments in IT infrastructure and licensing contracts. Not only that, the software tended to be hard to learn and temperamental. But a good CMMS software today is easy to learn and easy to use, offering a clean, intuitive interface and go-anywhere accessibility. Providers use cloud-based computing to make sure your data stays secure. And it's always your data; good providers are just babysitting it for you; of course, you can have it back whenever you ask for it.