|
Alfred A. Brooks - 4/14/99 |
Introduction: This brief description of statistics is written to help lay persons understand the simple statistical statements that occur in many of the reports about the Oak Ridge Reservation.
Simple statistics is not difficult to understand. Most of us use some simple statistics every day without even thinking about it. We step on the bathroom scales two or three times because we believe that one measurement may be in error. To get a better number we might average three measurements of the scale. If one measurement were very far from the other two, we would believe that it might be wrong and throw it away. Statistics is just the extension of these almost instinctive actions, using mathematics to justify the methods. The following description of simple statistics will skip almost all of the mathematics and describe how the results can be interpreted. We will use some very simple examples to illustrate what we mean. The bold text contains the most important ideas. The regular type is commentary and explanation; it should be read but not necessarily memorized.
Statistics is an exact science that deals with inexact information. For this reason, it cannot always lead to numerically exact conclusions but rather states probabilities about these conclusions. Statistics by itself cannot prove cause and effect but can help reach these conclusions. When the sample size is truly large and the problem well defined, statistical results can be very, very accurate and useful. Few problems are this easy, most have small samples and data errors.
Populations and Distributions
Populations - A population is just a group of similar individuals, say, a group of fifth grade students. Most measured values of a population vary over some range; some students are taller or shorter than others, some weigh more or less than others. In other populations, such as, repeated analyses of the same sample, the variation is usually small and due to random errors.
Distributions - A distribution curve describes the variation in a set of measurements made on some individuals in a population. Most often there is a hump in the middle with equal tails going off in both directions; this is the well-known bell curve. One specific group of bell curves is called the normal distribution. It can be very narrow or very wide. (See the Figure.) Some distributions do not have this shape and the tails may not be symmetrical. In many cases it is assumed that the distribution curve is normal or close to it. In any case, it is important that any sample of the population used to draw statistical conclusions be chosen randomly. If the samples are biased it severely limits the conclusions that can be drawn form the data. For instance, if the sample of students were all boys then nothing could be said about the average weight of girls.
The Description of Distributions
A distribution curve is described by five numbers, called --- sample size, average, standard deviation, percentiles and confidence limits --- that are useful for many commonly occurring distributions.
Sample size - The sample size is simply the number of individuals in the total population on which measurements are made. In most instances, the larger the sample size the better the sample represents the total population and the better the estimates. One must be very careful with small sample sizes or non-random samples, the calculated results may not be very reliable. In our example, we will use six students. This is a pretty small sample, twenty would have been better.
Average - The average (sometimes called the mean) of a group of measurements is the best estimate of the true average for the entire population. The average is sometimes referred to as a central value, meaning it lies near the center of the distribution. It is calculated by adding up the measurements and dividing by the number of measurements. (Note: There are two more central values sometime used instead of the average. They are the median, the middle measurement; and, the mode, the top of the distribution curve. They usually are not as useful as the average.)
Example: We have a class of 100 students and wish to estimate the average weight of the class and weigh six randomly selected students. The measured weights of the six students are 66, 75, 80, 85, 90, 96 pounds. The sum of their weights is 492 and the average is 492/6 = 82 pounds. This is the best estimate of the average weight of the class we can make from these measurements.
Standard Deviation - The standard deviation of a distribution is an estimate of the variation or scatter in the measured data. It may be due to natural variation or random error or both. About 2/3 of the data points are expected to lie within one standard deviation of the average. It is calculated by multiplying each deviation (the measurement minus the average) by itself, then adding up all the products gives the sum of squared deviations. Dividing by one less than the numbers of deviations (measurements), gives the variance; the square root of the variance is the standard deviation. Note: the square root of a number multiplied by itself gives the original number.
For the example, the sum of the squared deviations is 578, the variance is 578/5 = 115.6, and the standard deviation is = 10.75 pounds.
The variance includes the variability of the population and any errors in the measurement method. Making duplicate measurements on the same sample allows one to eliminate this error but not the natural variability. The variance can be used to calculate the reliability of the average and other numbers.
Distribution Percentiles - The n-th percentile value of a distribution is a value such that n-percent of a large set of measurements are less than the percentile value. The percentile can be determined from the actual measurements or calculated by assuming the shape of the distribution, usually normal. For our example: the calculated 90-th percentile is about 93 pounds. Upper and lower percentile limits can define a range that includes a specified percent of the measurements or of the assumed distribution.
Confidence Interval - The n percent confidence interval is a range around the average such that n-percent of the confidence limits of repeated samples are believed to contain the true average. The n percent confidence interval is a range around the average such that there is an n-percent expectation (or confidence probability) that, on repeated sampling, the confidence interval will contain the true average. There is an upper and lower limit to the confidence interval. The larger the percentage the larger the confidence interval. The usual confidence interval given is the 95% or 90% confidence interval. For our example: the calculated 90-th percent confident interval is 74.8 to 89.2 pounds; there is a 90% chance the true average weight of the students is between 74.8 and 89.2 pounds. Another way to look at it is that there is a 5% chance that the true result lies below 74.8 and an equal chance that it lies above 89.2. Any statistical claim that is not supported by a confidence analysis and statement should be considered incomplete. If the sample size is small, the results should be considered as tentative.
Comparisons of Averages - There are sophisticated tests for the comparison of averages but for the lay person if either average lies within the 90% or 95% confidence interval of the other, the results can be considered pretty much the same for most purposes.
Simple Epidemiology
Epidemiology is simply the comparison of one population against a presumably different population to determine if there are meaningful differences. Everything discussed above, in Simple Statistics is applicable to the each population separately. In addition, the Prevalence Rates of some attribute, such as a disease, are compared between the two populations.
As an illustration of how this is done,
we examine the following number of sick and well individuals for two populations,
one exposed, one not:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Prevalence Rate is the number of sick individuals divided by the total size of the test population. The Prevalence Rate Ratio (PRR) is defined as Prevalence Rate (PR) of the exposed population divided by the PR of the unexposed population. This should be significantly greater than 1 if the exposure caused the disease. For our case, the PRR = 0.375/0.351 = 1.07. The exposed population has a slightly higher prevalence rate but the real question is it significantly larger or just due to chance.
A confidence interval for the PRR can be calculated. For our case the 95% Confidence Interval is 0.5 to 2.6). This says that here is a 95% chance that true PRR lies in this interval, which includes 1, so the result is not significant at the 95% level which is the usual judgmental level.
The above sounds real simple but there is a booby trap: the two populations must be exactly the same except for the exposure being examined. This is hard to do. All manner of things vary, ages, personal habits, home environment, smoking, and so forth. These are called confounders and they make epidemiologist's life difficult indeed. Careful design of a study can help reduce the effects of confounders.
Confounders - A confounder is any circumstance, other than the desired exposure, that makes one population different than another. Confounders that are related to the exposure rate are particularly bad. Some confounders are known and corrections can be made but if the effect of the confounder, such as smoking, is larger than the effect under study, then it still reduces the validity of the study. Unknown confounders, and they can be very subtle, can often completely distort the results of a study. Many epidemiologists view any single study with a PRR less than 2 to be automatically suspect until confirmed by independent means. The range of 2 to 4 is regarded as possibly significant but in need of conformation. The range of 4 and up is usually regarded as significant if nothing looks suspicious or there is no conflicting evidence. At any level, confirmation is desirable before the panic button is pushed.
Environmental Clusters - An environmental cluster is a group of sick persons, which live, in close proximity with each other in one neighborhood. A great deal of effort has been directed toward finding an environmental cause for many environmental clusters. Very, very seldom has a cause been found. The reason is that while the cluster may appear large to the lay person, it can most usually be accounted for as a local variation well within the predicted statistics. A disease like cancer, which causes about a 33% of all deaths, can easily show up as a cluster in a neighborhood even if there is no local cause. The same is not true for occupational clusters where doses are bigger and causes are found.
Dose Reconstruction
Dose reconstruction is the process
of estimating the past dose to a contaminant by examining the old data
and postulating an exposure model. A dose reconstruction should present
the average dose as well as the upper and lower confidence limits of the
reconstruction. The estimated doses will contain all the known natural
variably of the input data and also the variability due to all the assumptions
that need to be made as well as the unknowns aspects of the dose reconstruction
model itself. For these reasons a dose reconstruct usually contains a wide
margin of error and is results should be judged accordingly. A dose
reconstruction that makes unnecessary conservative assumptions, or does
not properly account for current levels of the contaminant or account for
other current observations should not be taken too seriously.

The above graph shows the normal distribution curve, which describes the data used in the example. The data used was:
Student Weights: 66, 75, 80, 85, 90, 96 pounds
Sample average = 82
Standard Deviation = 10.75
90-th Percentile = 95.8
90 % Confidence Interval for the Average = 74.6 to
89.2
The curve was calculated from the average and the standard
deviation of the data. The 90-th Percentile was calculated for the curve
and the 90% Confidence Interval for the Average was calculated from the
data. Note that 90% of the area under the curve lies to the left of the
90-th Percentile line. If repeated samples were taken, one would expect
the 90% of the averages to fall in the confidence interval.