MTBF, Reliability, and Availability Prediction Workbooks for Redundant Systems

Larry George, Problem Solving Tools, Oct. 22, 2102

The traditional parts-count MTBF (Mean Time Between Failure) prediction is

MTBF = 1 / S (part failure rates)*(fudge factors),

where the sum is over all parts [MIL-HDBK-217F Appendix A, Bellcore, Telcordia, and others]. Such MTBF predictions tacitly assume:

Series systems¾failure of any part causes system failure

Independent lives¾P[Series system life > t] = P P[Life of part j]

Constant failure rate = 1/MTBF or P[Life > t] = Exp[-t/MTBF]

Despite criticism of and the lack of credibility of MTBF predictions [George, June 2001; and Jones and Hayes], people continue to compare MTBF predictions. So let’s make MTBF, reliability, and availability predictions consistently and correctly, using established probability methods, with redundancy.

The traditional parts-count MTBF prediction is incorrect if the system has redundancy, because subsystem failure rates are not constant and reliability is not an exponential function. Telcordia (Bellcore) TR-332 provides no help; it says, “If the serial model is inappropriate, a suitable reliability model must be developed.” We can still predict MTBF, reliability, and availability, without the series and independence assumptions.

A redundant system usually provides greater reliability and availability than a series system. Redundancy methods include:

Parallel¾usually  independent and identical parts

K-out-of-n¾at least k out of n parts must work for system to work

Cold standby¾parts are nominally in parallel, but their lives are sequential and, if lucky, parts don’t deteriorate in standby

Networks¾more elaborate arrangements than parallel or k-out-of-n, such as a Wheatstone bridge (distributed processing seems equivalent to parallel processing, with overhead in series with the parallel processors)

Can we simply eliminate redundant systems from an MTBF prediction, because redundancy improves reliability so much? You can, but, although k-out-of-n systems have a lower failure rate than n times the part failure rate—that’s what the traditional, parts-count failure rate prediction would be— they have higher failure rates than the part failure rate itself. Don’t eliminate redundant subsystems from MTBF predictions; even though they usually make systems more reliable, redundant subsystems may fail. Use the workbooks described in this article.

Could we add the failure rates (1/MTBF) of redundant subsystems in series to obtain system MTBF predictions? [NASA] We could, but unfortunately, the MTBF and reliability will be wrong and not necessarily conservative. The failure rate of a redundant subsystem isn’t constant, 1/MTBF, even if the parts have constant failure rates. How should you make an MTBF prediction for a redundant system in series with another system, such as the system in figure 1? Don’t use the MTBF formula for two systems in series, 1/(1/MTBF1+1/MTBF2), unless both systems 1 and 2 have constant failure rates. Use the workbooks…

Could we use the availability formula MTBF/(MTBF + MTTR) for redundant systems? We could, but the real availability depends on age as well as redundancy. Even though you plug in the redundant system MTBF formula, the availability is incorrect, because repair time may not be exponentially distributed and because that formula doesn’t take redundancy into account. Use the workbooks…

WORKBOOK SOFTWARE

The workbooks are Excel spreadsheets and VBA software to compute and simulate (when necessary) MTBF, reliability, availability and the standard deviations of these parameter estimates induced by simulation. Spreadsheets allow parameter and reliability block diagram inputs, because spreadsheet cells resemble block diagrams. Figure 1 shows a series system where subsystem 1 consists of 5 parts in parallel, one of which is required, subsystem 2 consists of 4 parts in parallel, 2 of which are required, and so on. The FITs of subsystems 5 and 6 could have been added to make a single subsystem in series with the other, redundant subsystems. Simply enter FITs to indicate presence of a part, starting from the left. Specify how many subsystem parts are required. FITs don’t have to be the same within subsystems. The spreadsheet computes the remaining entries.  (FITS stands for Failures In Thousands (of millions of hours); i.e., one FIT = failure rate per 1E9 hours (billion hours). It is 1000 times the failure rate per million hours.)

 Part Subsys 1 Subsys 2 Subsys 3 Subsys 4 Subsys 5 Subsys 6 1 100 2000 2593 3000 3200 2400 <-FITs 2 100 2000 2593 3000 <-FITs 3 100 2000 <-FITs 4 100 2000 <-FITs 5 100 <-FITs 6 <-FITs 7 <-FITs 8 <-FITs K out of N 1 2 1 1 1 1 <-Enter N 5 4 2 2 1 1 Computed Nsubsys 6 Computed Nseries TRUE Computed

Figure 1. Sample input to Series-Parallel workbook

Although the reliability statistics for parallel and k-out-of-n systems has been known for many years, [Klion is cited in MIL-HDBK-217], the mathematics for combining such subsystems is not common [George, Dec. 2001]. There are at least five web sites that do it incorrectly: three belong to NASA.

The workbook computes MTBF, reliability, and availability from 1000 simulations. If you want more simulation runs, run the VBA macro. There are two versions of this workbook, one for series-parallel systems such as shown in figure 1, and one for parallel-series systems which replicate systems like figure 1 in parallel. If you have other needs, we will try to provide the spreadsheets and VBA software to satisfy them. Contact pstlarry@yahoo.com for the workbook.

References

George, L. L., “MTBF Versus Age-Specific Reliability Prediction,” ASQ Reliability Review, Vol. 21, No. 2, pp 13-15, June 2001

George, L. L., “MTBF Prediction for Redundant Systems,” ASQ Reliability Review, Vol. 21, No. 4, Dec. 2001

Jones, Jeff and Joseph Hayes, “A Comparison of Electronic-Reliability Prediction Models,” IEEE Trans. on Reliability, Vol. 48, No. 2, pp. 127–134, June 1999