Statistics |

Statistics tends to be an imposing topic for many medical students. While this page won't even go so far as to cover the material you'd expect in a basic statistics course, it will introduce you to the underlying concepts and terms that you need to know.

Be aware that an important and often discussed area of medicine and public health which involves statistics is epidemiology, which we cover on a different page.

When we analyze something, our assumption is that we are trying to determine if something is true. Unfortunately, proving something as absolutely true is almost impossible - what we settle for in science is the absence of falsehood. This, in fact, is what separates scientists from people who claim to know the absolute truth - what we claim has to be stated in such a way that there are available methods to prove we might be wrong. In other words, if someone makes a statement which has no way of being *invalidated*, or proven to be false, then it isn't science.

The culture of science is one that tends to be aggressive in searching for error. When something appears to be true, scientists' initial instinct is to expose faulty assumptions, to account for the apparent result in other than the obvious way, to assume that what appears to be true is, in fact, false.

If I take an aspirin and then perform well on a biochemistry test, it may appear "obvious" that the aspirin improved my performance. A scientist would ask if, in fact, one caused the other, or if it was more likely to be a matter of coincidence. What does it mean to perform "well" on an exam? How would I have performed without the aspirin? Will I consistently perform better *with* the aspirin than without it? Does it only help if I started off with a headache? Can something else account for my performance - increased studying, better sleep, or simply my belief that aspirin improves grades?

There are generally considered to be 4 sources of error. These need to be "ruled out" before a scientist, grudgingly, accepts that an apparent effect is *probably* not due to error, and *may*, therefore, be true.

These sources are:

- Noise: An error in measurement which fluctuates in an unpredictable or random way. It does not favor any particular result. Fortunately, with a large enough "
," or sample size, noise tends to cancel itself out. The more tests I take after aspirin, the more likely I am to also factor in poor performances, as well as more average performances. If I also take a large number of tests*n**without*aspirin, I may eventually notice that my average performance*without*aspirin is*about*the same as my average performance*with*aspirin. How many tests are enough? What's the best way to really test my theory of magical aspirin? The expert in this case is the statistician. - Bias: An error in the measurement or interpretation of data that systematically favors one result. In our aspirin experiment, we might ask people to mail us reports of their scores. Subjects may be too embarrassed to report low scores, so we might be misled into thinking that people do better than they actually do.
- Confounding: This is a cross between signal (truth) and noise - in other words, being right for the wrong reasons. Sometimes it appears that A causes B. Confounding occurs when both are associated with another variable (C, the confounder), which is really the cause. It could be that I really do consistently perform better when I take an aspirin, because on mornings when I've been studying for days at a time, I have a headache, so I take an aspirin. In this case, the studying is probably the real reason for my good grades, although it appears that it's the aspirin - the studying is the confounder. How do you avoid confounding? The experts on this are statisticians and epidemiologists.
- Fraud: This occurs when someone intentionally distorts data to support their position. If I do the big aspirin experiment and it turns out to have no effect, but I only report the instances when I see an improvement, that's fraud. How can you be sure the papers you read aren't fraudulent? You can't be certain, but peer-review and reproducability are the best solutions we currently have.

"** p**" is the way we describe how likely it is that an effect is being caused by some sort of error. In other words: How often does random error produce an apparent effect as big as the one we are seeing? The answer is "p." The smaller the "p," the more likely that the effect is systematic and not random.

Regardless of the size of p, the possibility of error still remains - of which there are two major types:

- Type 1:
**(False Positives)**lead you to believe that there is a true effect when there really isn't. The likelihood of this happening is*p*, also referred to in this context as alpha. - Type 2:
**(False Negatives)**lead you to believe there is no effect, although there really is one. The likelihood of this occurring is represented by beta, or "power."

The rule of thumb is that something is accepted to be "**statistically significant**" if the **p** value is less than 5% - written as p<.05. While this number is arbitrary (how comfortable would you be knowing that the odds were "less than 5%" that your plane was likely to crash?), it tends to be sufficiently accurate for making many predictions. Obviously, the lower the p value, the better.

Be certain not to confuse statistical significance with **clinical significance**. A drug that has been shown to have a 30 second decrease in the average length of a migraine (a small effect) *consistently* - so consistently that the odds are less than 5% that this decrease is due to chance - really has limited to no real benefit or effect on patients or their conditions.

Mean | Mathematical "average." Calculated by taking the sum of all the values and dividing by the number of values. The mean of 0, 1, 1, 1, 2, 3, 3, 4, 5, 9, 70 is 99/11, or 9. Can be highly influenced by "outliers," such as the 70 in our example. |
---|---|

Median | The "middle" number. Calculated by listing all values in numerical order and picking the one in the middle of the list. The median of 0, 1, 1, 1, 2, 3, 3, 4, 5, 9, 70 is 3. Is not influenced by outliers. |

Mode | The "most popular" number. Calculated by listing all values and seeing which value gets listed the most times. The mode of 0, 1, 1, 1, 2, 3, 3, 4, 5, 9, 70 is 1. |

Mean, median, and mode are all "measures of central tendency," meaning that they describe something common about a group of numbers. They are often combined with information about how those numbers are distributed (using either standard deviation or variance). Knowing just two numbers - a measure of central tendency and a measure of distribution - all by themselves, can allow us to capture much that is meaningful about a large collection of numbers. For this reason, results are often reported in journals in the following format: 100(15), where 100 is the mean and 15 is the standard deviation. | |

Standard Deviation | Assuming a "bell-shaped" distribution, what range of values are close enough to the mean as to be statistically indistinguishable? In other words, how far from the mean can a value be and still not be considered an "outlier?" In a bell-shaped curve, 66% of all results lie within a range that extends from one standard deviation above the mean to one standard deviation below the mean, and 96% fall within an area covered by two standard deviations in either direction. 100(15) is the standard for IQ measurement, so 66% of the population has an IQ between 85 and 115. 96% of the population falls between 70 and 130 - this is where the definitions of mental retardation (IQ < 70) and
genius (IQ > 130) originate - these scores are sufficiently unlike the rest of the population as to warrant attention. Think of standard deviation as simply being a number that describes "how spread out" a set of numbers is. Scores with higher standard deviations (or higher variance - which is simply the standard deviation squared) are just more dispersed. If the standard deviation of the IQ test were 20, then a score of 65 would no longer be all that abnormal, since it was within the expected spread. |

t-Test | This is probably the most basic statistic you should know (other than mean, mode, and median). A t-test is used to compare one group or population to another group or population - this form of comparison is referred to as "between groups." Usually one group is the experimental group (the one receiving treatment, for example) and one is the control (a similar group, but not receiving treatment). An example of an experiment that might utilize the t-test is a comparison of a new beta-blocker to labetolol in reducing heart attack. |

Z-Test | This test is a variation on the t-test - in the Z-test, more information is known about the statistical properties (such as the means and the standard deviations) about each of the populations. Again, this is a "between groups" statistic. t-tests and Z-tests are used to compare exactly two groups - when more than two groups need to be compared, ANOVA is used. |

ANOVA | The same principles as the t-test (and Z-test) are used in this test, which is employed when more than two groups need to be compared. The ANOVA (ANalysis Of VAriance) is a more complicated application of between-groups testing. For example, it might be used to compare the efficacy of 4 beta blockers in preventing heart attack. |

Chi-Square | This test is used to analyze data which have been organized in terms of frequency - how often does an event occur? It answers the question of whether a given frequency differs from what is expected. It can be used as either a within groups or a between groups test. |