Virtual Reliability Statistics or How to Cheat With Reliability Statistics
Sept. 9, 2004: Revised April 31, 2006; July 20 and Aug. 27, 2007; May 10, June 4, Nov. 19, and Dec. 11, 2010; Jan. 22, 2011, April 18, 2011
“Figures
often beguile me, particularly when I have the opportunity of arranging them
myself; in which case the remark attributed to Disraeli would often apply with justice
and force: ‘There are three kinds of lies: lies, damned lies, and statistics.’”(Mark
Twain)
“The idea of falsification is that one should not accept a new finding
uncritically, but should do one's best to devise experiments to discredit
(falsify) it; that which survives the hardest tests is taken as the closest to
truth” (Karl Popper)
This collection lists reliability statistical cheats not in Darrell Huff’s, “How to Lie With Statistics.” If you are not familiar with the Huff’s obvious and brazen lies, you should be. The cheats in this list are subtler and may not have been recognized by their perpetrators. I made some, inadvertently, and I thank those who contributed the others, regardless of how they produced them.
A company tested six sets of 80 power LEDs for 6000 hours [LM-80 test method] and published the lumen measurements and extrapolations to the age of 70% of initial lumens (L70). The L70 extrapolations are overfitting, MTTF confidence limits appear to be on the extrapolations, and the Weibull reliability model doesn’t fit. After initial variation, lumen measurements are normally distributed, so LED L70 reliability has an inverse-Gaussian distribution. The simplest model of LED deterioration is geometric Brownian motion with drift, the Black-Scholes stock price model that led to hedging, LTCM, SIVs, CDOs, CDSs, and the financial crisis. This model gives the inverse-Gaussian L70 parameters estimates from the lumen measurement drift and variance. Power LED and other semiconductor reliabilities may differ, but the geometric Brownian motion and inverse-Gaussian reliability model are appealingly parsimonious.
Table 1. Alternative methods to estimate MTTF and LED L70 reliability. Explanations follow table.
|
Method® |
Parametric Statistics |
Creative statistics |
Black-Scholes geometric-Brownian motion |
|
Data |
iid sample of ages at 70% lumens |
lumens over time (80 LEDs) |
lumens over time (80 LEDs) |
|
Distribution |
Physical, theoretical, or empirical |
Extrapolate all 80 “exponentially” to L70 |
Mixture of inverse-Gauss |
|
MTTF estimate, LCL, and reliability |
From sample data and distribution parameters |
Fit Weibull from positive L70s (some were negative) |
Compute from geometric Brownian motion parameters |
|
Parameter estimates |
Typically 2: scale and shape |
160 regression + 2 Weibull |
2: drift and variance |
Parametric Statistical MTTF Inference
1. Collect independent and identically distributed random sample of ages at failures
2. If there is some theoretical or physical justification for a distribution, estimate parameters of the distribution, typically two parameters
3. Compute MTTF as a function of parameter (estimates) and estimate a statistical confidence limit on MTTF.
Creative LED MTTF Inference
1. Collect 80 LED lumen measurements from 0 to 6000 hours [LM-80]
2. Extrapolate each of 80 LED’s lumen measurements “exponentially” to L70 ages, ~100,000s of hours, 160 parameters. Fit Weibull, two more parameters. (Ignore negative L70 extrapolations, 6.)
3. Estimate lower 90% confidence limit on MTTF of Weibull fit to extrapolations
Black-Scholes LED L70 Reliability Inference
1. Estimate drift and variance parameters from 80 LED lumens measurements
2. Extrapolate LED’s 6000-hour lumens as geometric Brownian motion
3. Estimate L70 reliability function as a mixture of inverse-Gaussian distributions, a function of the drift and variance parameter estimates
For the full story and a constructive model of semiconductor deterioration reliability, please read http://www.fieldreliability.com/PhilLEDs.doc.
PS: The source, http://www.philipslumileds.com/pdfs/DR03.pdf, now contains 10,000-hour test data and no longer publishes the extrapolations [dated “10/05/28”, May 28, 2010].
The Weibull reliability model is P[Life > t] = exp[-(t/h)b]. I don’t object to using the Weibull reliability model when:
· There’s some physical justification such as failure of weakest link in a chain
· They fit as well as nonparametric statistics, according to the Akaike Information Criterion, http://en.wikipedia.org/Akaike_information_criterion
· The entropy increase due to the Weibull assumption is justified or negligible
· Extrapolation beyond the age of the oldest failure is not taken seriously.
A product had “component D” failures in 12 months. (I have been assured that this is NOT a GE appliance component.) How many more would fail within their 36-month warranty? ASQ’s Quality Progress Statistics Roundtable published the data, Weibull analysis, a forecast, and a prediction interval. The data included 12 monthly left-censored failure counts collected at one calendar time. The Weibull analysis included actuarial failure forecasts. Http://www.fieldreliability.com/QPMeeker.doc describes nonparametric alternatives to Weibull analysis and quantifies extrapolation uncertainty. The nonparametric forecasts are larger than the Weibull forecasts. Alternative extrapolations of nonparametric failure rates from data subsets quantify uncertainty.
Thanks to the magic of Weibull mathematics, MTBF = hG(1+1/b), decreasing b increases MTBF. Assume reliability is a mixture of two Weibull reliability functions, one with infant mortality b < 1 and one with wearout b > 1. By changing the mixture proportion, you can achieve practically any MTBF you want. Process defects cause infant mortality, so mix process defective products with good ones in the right proportion, and you can get a great MTBF.
“When updating life data after a previous analysis, it is important to understand the impact of the choice of the model and the statistical analysis method upon the results. A good practice is to look at probability plots in order to understand the total behavior of the model. When the analysis and further updates are performed without examining the overall behavior of the statistical models, wrong conclusions can be drawn.” (http://www.weibull.com/hotwire/issue117/relbasics117.htm)
“An aerospace manufacturer is looking at the reliability of a new component with an intended mission duration of 2,100 hours. The reliability engineer has internal test data for the beta version and has also collected reliability information from the company’s beta customers after 2,500 hours of usage, in which no failures were seen.”
There were 10 “internal” test failures, and 30 beta test units survived 2500 hours. “The reliability engineer, Lisa, decides to use a 2-parameter Weibull distribution to analyze the data” [for no apparent reason]. Her Weibull software estimated
P[Life > 2100 hours] = 97.65%.
The 30 beta test units went on to survive 7000 hours. Lisa was then surprised to find the Weibull software estimated
P[Life > 2100 hours] = 95.61%.
There were only two test failures within 2100 hours, at 1591 and 1866 hours. An empiricist would says P[Life > 2100 hours] is 1-2/40 = 95%, regardless of the survivors’ ages beyond 2100 hours.
If you plug enough DoAs (Dead on Arrival, e.g., failures at or near age zero or on first cycle), you’ll get Weibull shape parameter estimate of b < 1 and a correspondingly larger MTBF. Figures 1 and 2 show Weibull probability paper plots with DoAs.
To make the Weibull probability paper plot and estimate parameters, Weibull software computes ln(Age). If you input age at failure = 0, ln(0) = -¥, the software will either crash or alter your input. You or the software might have to convert the age at failures of the DoAs from 0 to 1.0, but, if the age scale of most failures is great compared with age 1.0, that approximation is acceptable. Regardless of how tiny ages at failures are treated, Weibull software blithely estimates Weibull shape parameters b < 1. This drastically underestimates b and drastically overestimates MTBF.
Weibull software is intended to estimate parameters of a continuous distribution, because Weibull reliability functions model continuous random variables. If failures are grouped near specific ages, like DoAs, use a mixture model with mass at those ages and Weibull reliability conditional on survival beyond those ages; e.g.,
P[Life > t] = P[DoA]+(1-P[DoA])*exp[-(t/h)b].
P.S.: Don’t use the three-parameter Weibull reliability function, exp[-((t-d)/h)b], unless you are reasonably sure the delay d is not random. The fixed delay d should not be used to model random sell-through time or shelf time.

Figure 1. Weibull probability paper plot with lots of DoAs. The data come from
three successive lots.
Figure 2. Another Weibull probability paper plot with failures on the first stress cycle
Weibull software input is supposed to be a random sample of ages at failures and survivors’ ages, grouped or not. Such a sample could come from units that were started a test at the same time (cohort) and observed until some future time or until a specified number failed. Production and field data come from different cohorts such as monthly production, which have different ages at the end of observation or time now.
Suppose you carefully compute each cohort’s operating hours at the end of observation and plug that and the observed hours at failures into Weibull software? What’s wrong with that? The resulting Weibull reliability and MTBF estimates are probably biased high. Why? The failed units had smaller operating ages at failures than their cohort, but we don’t know which cohort they came from. They may have been treated as if they underwent instant repair and resumed operation, good-as-old (aka relevation). Typical input format encourages this cheat, e.g. table 1, http://www.reliasoft.com/Weibull/examples/w7ex1/index.htm.
Table 1. Typical Weibull software input
|
Number in State |
State |
Time to |
Subset |
|
30 |
S |
1 |
|
|
28 |
S |
2 |
|
|
25 |
S |
3 |
|
|
17 |
S |
4 |
|
|
11 |
S |
5 |
|
|
9 |
S |
6 |
|
|
1 |
F |
0.3 |
|
|
1 |
F |
1.1 |
|
|
1 |
F |
4.5 |
|
I recommend input described in www.fieldreliability.com/KMUsrMan.htm, so that cohorts and failed units are removed from the units at risk at their ages at failures or at survivors’ ages [Klein and Moeschberger, Survival Analysis, Springer, 1996, pp. 138-9]. Alternatively, if there is no way to find out which cohort failed units came from, contact pstlarry@yahoo.com for the nonparametric maximum likelihood (Kaplan-Meier) reliability estimator and the least squares Weibull fit to it.
Weibull software plots an empirical reliability function estimate and confidence limits on it. Although the empirical reliability function wiggles, it stays within the confidence limits. [I’ll include the plot if I get permission. Http://www.reliasoft.com/Weibull/examples/w7ex3/index.htm is a 1-sided plot.] The confidence limits are probably computed under the assumption that the underlying reliability function is Weibull. Does that mean the Weibull reliability model is OK? That depends on what you mean by OK. Ordinarily confidence limits on a reliability function apply at one age at failure t(1), P[R(t(1)) > r(1)] ³ 0.95, as in www.fieldreliability.com/KMUsrMan.htm. The plot implies that if the entire reliability function lies within the confidence limits, then with 95% confidence it’s probably Weibull. Sorry, that’s not even true for ages at failures t(1), t(2),…,t(k) P[R(t(1)) > r(1), R(t(2)) > r(2),…, R(t(k)) > r(k)] ³ 0.95. Nor is it true that P[R(t(1)) > r(1)]*P[R(t(2)) > r(2)]*…*P[R(t(k)) > r(k)] ³ 0.95k, because the estimators r(1), r(2),…,r(k) of the upper confidence limits are dependent. Does Weibull software account for that dependence or the confidence limits a plot of single confidence limits for one age?
If you want a confidence limit on an entire reliability function, at least that portion between the youngest and oldest failures, contact me. There are several references on confidence limits on reliability functions. I’ll dig them out and program them if somebody sends me data.
Truncating a Sequential Probability Ratio Test (SPRT) at a maximum number of failures or a maximum total test time has widespread practice. (An example is entitled “A Sequential Reliability Test Plan,” http://en.wikipedia.org/wiki/Reliability_engineering.) Truncation of Sequential Probability Ratio Tests (SPRT) based on maximum number of failures or a maximum total test time introduces error in the actual risks compared to an infinite SPRT test. Design Maturity Testing (DMT) is sometimes truncated at the number of failures or the associated time of a non-sequential test having the same type I and II errors and discrimination ratio. For practical purposes Design Maturity Testing (DMT) is truncated at the number of failures or the associated time equal to a non-sequential test having the same risks and discrimination ratio.” How much error?
SPRT errs because it approximates limits. DMT compounds error by forcing wrong decisions, sometimes.
I simulated SPRT and DMT test errors for an exponential MTBF test. Figure 5 shows “ttt”, total time on test, and p represents failures. SPRT accepts above the upper limit and rejects below the lower limit. The SPRT shown didn’t stop until the 20th failure, and the DMT conclusion at the 14th failure was ambivalent.
Managers dislike SPRT despite smaller expected sample sizes because the sample size is random. Regardless, don’t truncate SPRT because you don’t like current results.
![]()

Figure
3. Simulation of untruncated SPRT of an MTBF (ttt stands for total time on
test)
Since original exposé of this error (http://www.fieldreliability.com/NwsRev2.doc), there have been publications of statistically correct versions of the same test plan: http://cresst96.cse.ucla.edu/reports/R606.pdf.
Hypothesis tests about MTBF or reliability are typically stated so that the null hypothesis is that the MTBF or reliability is acceptably good enough. In that context, hypothesis tests have two types of errors: rejecting a good product (“producer’s risk”) and accepting a bad one (“consumer’s risk”).
Here are not one, not two, but three ways to pass MTBF demonstration tests:
· Rerun tests until pass
· Add samples until pass
· Change specifications to what the test demonstrated (Doganskoy, Hahn, and Meeker, Quality Progress, June 2007, p. 74, pointed out by Wes Fulton, QP Mailbag July 2007)
If there is any probability of passing a test, retest until a product passes. It’s a little subtler to add samples to a multiple item test, until the test passes. It’s not subtle to change the specifications.
Suppose an acceptance test has a 10% probability of accepting the specified MTBF when the true MTBF is less. If you rerun the test, and the tests are independent, the probability of accepting in two tests is 19%; if you run it three times, the probability of accepting is 27%. But the production line will still ship products with lower MTBF than specified.
Hypothesis tests set up for one sample size, significance level, and power may have a 10% probability of accepting the null hypothesis when false. Suppose the sample is taken and the alternative hypothesis is accepted. Add some more samples and repeat the calculation of the test statistic, using all the samples. Now the probability of accepting the null hypothesis may be as high as 19%.
Suppose test results are looking bad, failures are accumulating. Why not add some more samples to the test? Two failures out of 20 are better than 2 failures out of 10.
Alternatively, “Say I have a success-run test on 100 units. I plan to run them for 6 months. I want to demonstrate 97%Reliability at 95%LCL. After 4 months I have a failure. I say that I have demonstrated 97% Reliability at 95% LCL at 4 months since up until that time, I had no failures (though the original goal is out the window). Is this correct?” [http://www.reliasoftforums.com/showthread.php?p=1540#post1540]
It is tempting to model reliability functions of infant mortality and good products with one reliability function up to the age at which infant mortality ends (assuming it ends) and another thereafter. If done carelessly, one can achieve great MTBF (What MTBF Would You Like?) as well as high reliability.
Figure 4 shows two probability density functions representing bad and good parts and their mixture pdf with a mixture of 10% bad.

Figure 4. Probability density functions for a mixture population, 10% f1(t) and 90% f2(t)
The cheat is that reliability functions are supposed to be nondecreasing, but if you only plot pdfs, you may not notice the glitch in the reliability function that changes at the end of infant mortality. Of course, it’s hard to estimate the reliability function of the good products, because you have to wait a long, long time.

Figure 5. Reliability function for same population, as a mixture (correct) and changing with age (incorrect)
Figure 4 shows overlap representing circumstances when random stress, the pdf with smaller numerical values, exceeds random strength, the pdf with larger numerical values. It is tempting to compute P[Stress > Strength] as the area under the overlapping intersection. Oops, one of my students actually did that. It’s not simple. Unfortunately, that area is not P[Stress > Strength].
Look at the mathematical formulas for the alternatives. The overlap area in figure 4 may be computed as òmin[f1(t), f2(t)]dt from 0 to infinity, where f1(t) and f2(t) are the two pdfs. The P[Stress > Strength] is, by conditioning on strength equal to x, the integral of P[Stress > x]*f2(x) integrating Stress from x to infinity and then Strength x from 0 to infinity. That is ò ò f1(t) dt f2(t) dx, where the inner integral runs from x to infinity.
Restrict the alternative hypotheses to unlikely possibilities and you can prove practically any null hypothesis. A typical application of this cheat that I learned in school is to test constant failure rate against the alternative of an increasing failure rate. That alternative hypothesis leaves out failure rates that may increase for some ages and decrease for others, a more likely alternative hypothesis than a monotonically increasing failure rate.
This is an example of Type III error, population misspecification. Ronald Fisher recognized the job of a statistician as:
1. specification of the kind of population that the data came from
2. estimation
3. distribution specification.
By specification, Fisher meant that the statistical distribution(s) involved should encompass both the null and the alternative hypotheses.
The 85th percentile is supposed to be that of unobstructed speeds. Of course, the actual speed limit is rounded, probably down to the nearest multiple of 5 or 10 mph. In urban areas, traffic makes unobstructed speeds difficult to achieve. So what do traffic engineers do about that? Nothing. So how much bias is there? As much as 5 mph.
Dependent samples. Suppose you’re doing reliability demonstration testing (RDT) at two ages, such as warranty and useful life (bogey). In other words, you’re trying to demonstrate that:
1. R(warranty) = P[Life > warranty] is at least some numerical value with some specified confidence and
2. R(useful life) = P[Life > useful life] is at least some other, smaller numerical value with some other specified confidence
The obvious procedure is to continue to run survivors of the warranty longer. The obvious result is that actual confidence is not what you think it is, because of dependence from using the same sample.
Annual failure rate plots line the walls of break rooms and cafeterias in many companies. Intense, interminable meetings involve speculation regarding causes of the random wiggles in the monthly AFR charts. Elation sweeps everyone when the AFR inevitably turns down on yet another step in a random walk of a moving average.
The standard deviation of an average is inversely proportional to the sample size, and AFRs may have a large sample size. Nevertheless, it takes a long time for changes in reliability to percolate through to moving averages, especially wearout in the old age of product lives. Variations in reliability that are short relative to the moving average may never show in AFR. Statistical process control charts on returns are the most efficient way to obtain early warning of process defects and premature infant mortality. Please refer to www.fieldreliability.com/PrRelAct.htm for information about forecasting returns with confidence limits.
AMSAA Technical Report No. 197, “Confidence Interval Procedures for Reliability Growth Analysis,” by Larry Crow is intended to quantify MTBF, not reliability. It is intended for application to repeated failures, of one unit on test, perhaps after repairs or changes intended to improve MTBF (TAAF = test analyze and fix). What if multiple products are started on test simultaneously or after some delays, repairs, or improvements? How can you apply the Duane-Crow-AMSAA reliability growth model? You can’t or shouldn’t just chuck all the data into reliability growth software and expect to get meaningful results, even confidence intervals on MTBFs; I don’t know the meaning of the results. What should be done with the data? How? MIL-STD-1635 (EC) “Reliability Growth Testing,” seems oblivious to this dilemma too.
A few years ago, I was asked for reliability growth analysis of two machines. The question was how long would it be until the specification MTBF was achieved. One machine started operation earlier than the other and both had some failures. Changes had been incorporated into both machines to improve reliability. I fit a nonhomogeneous Poisson process to the failure data and produced an MTBF estimate. Let me know if you would like to do this, or send data.