## Statistics can lie, but Jack Good never did — a personal essay in tribute to I J Good, 1916-2009

Posted by Henry Bauer on 2009/04/12

A few years ago, as I was looking into HIV/AIDS data, I came across a claim that certain sets of “HIV” and “AIDS” numbers were correlated. The claim was presented by an image with shadings for the respective numbers, and the shadings do look similar. However, using the actual numbers that were also in the image, I calculated each ratio of “HIV” to “AIDS” and found that the ratios looked more like a random set than like a constant. So I stuck the numbers into an EXCEL worksheet and used the inbuilt CORREL function to derive the correlation coefficient. Lo and behold, it came out as 0.88, which represents — or should represent — a very respectable degree of correlation. Despite that, the set of ratios doesn’t look like “HIV” is correlated with “AIDS”.

For some three decades, I had the privilege and pleasure of regular visits with Jack Good. The next time I saw him after this conundrum about correlation, I told him about it. Drawing on his prodigious memory, he pointed to something he had written, which the index to his publications revealed as #792 (out of more than 2000): “Correlation for power functions”, *Biometrics*, 28 (Dec. 1972) 1127-1129. This remarks that it’s rather well known that the usual (product-moment) correlation coefficient is an inappropriate measure when the relation between two variables is not linear; and that the presumption is common, that “inappropriate” means that if the calculation is nevertheless done, the result will be a small value for the correlation coefficient. indicating lack of correlation. To the contrary, Good showed that the (usual, product-moment) correlation coefficient between x and x2 (x squared), or x3 (x cubed), etc., is close to 1 (typically >0.95).

In common parlance, by two things being “correlated” we mean that when one changes, the other changes in the same direction * and in the same proportion*; in other words, that the correlation is linear and the ratio of the two variables is a constant. But “the” correlation coefficient that is most commonly used, and included in such packages as EXCEL as the primary correlation coefficient, measures only whether two variables change in the same direction.

Consequently, statements about “correlation” are very likely to be misunderstood, misinterpreted, by the media, by the public — and by an unfortunately large proportion of doctors and scientists who “do their own statistics” using software packages and standard formulas.

This example is merely one isolated case of the abiding, deep, highly important problem of interpreting what “statistics” is supposed to tell us. Many years of informal education by Jack Good have taught me about some basic pitfalls that seem unsuspected by far too many people who quote and use “statistical data” like the ubiquitous “p” values.

The HIV/AIDS literature — like so many others — is full of articles in which particular relationships are said to be so at “p > 0.05”, or “p > 0.01”, or even “p > 0.001” or less. The unwary reader sees “p > 0.001” and accepts that there’s only one chance in 1000 that the claimed relationship is spurious. That’s an incorrect and misleading interpretation.

In the social sciences, the typical cut-off for “statistically significant” is “p > 0.05”, which is commonly given the interpretation that there’s less than 5 chances in 100, less than 1 in 20, that the claimed relationship is spurious, wrong, doesn’t exist. An apparently better interpretation is to emphasize that 1 in every 20 of such claimed relationships doesn’t really exist; of every 20 such claims made, 1 is wrong. But in truth, the chance is far greater than 1/20 that the claimed relationship at “p > 0.05” is wrong, simply doesn’t exist.

Those “p” values stem from an approach credited to, created by, the great early statistician R A Fisher, and it’s sometimes called “Fisherian”, though more often “frequentist”. That “p” stands for “probability”, and one of the things I learned from Jack Good is that the seemingly obvious meaning of “probability” is anything but clear, or obvious, or unambiguous. The frequentist meaning: toss a coin umpteen times, the probability that it comes up “heads” can be measured by counting the number of “heads” or “tails”. But that approach doesn’t cover such questions as, “What is the probability that God exists?”, which is a perfectly possible question with a perfectly intuitive meaning of “probability”.

Jack Good was one of the foremost pioneers in bringing into modern statistical applications the approach credited to the 18th-century Reverend Thomas Bayes. At any given moment, available evidence allows a judgment to be made about how probable the thing of interest is; that’s the “prior probability”, and it’s unashamedly subjective, different individuals can differ over what its value is, between 0 and 1. However, as experiments are done or observations made, evidence accumulates, and the prior probability is modified by a “Bayes Factor” that expresses the “weight of evidence” regarding the thing of interest. When sufficient evidence can be amassed, it doesn’t matter what was the prior probability with which one began, the calculated values will converge to whatever the “true” probability is.

The Bayesian approach is more technically demanding than the Fisherian, and the latter offers those ready-made software packages and formulas. But it’s not sufficiently recognized just how fallible the Fisherian approach really is and how misleading it can be.

The citing of “p” values is typically done to establish a particular hypothesis as “statistically significant”. But that’s not what the calculation means. It actually measures the probability that the “null hypothesis” is true, the null hypothesis being that the claimed relationship does not exist. If the “p” value is small, the null hypothesis is unlikely to be true, so you claim that your hypothesis is correspondingly likely to be correct.

There are several problems with this. The most obvious, and trivial, is that the choice of cut-off for “statistical significance” is arbitrary. Social science typically uses “p < 0.05”. The harder sciences demand <0.01 or less, depending on the particular situation. What must * NEVER* be done is to equate “statistically significant” with “true” — but, of course, that’s exactly what

*done in just about every public dissemination of “statistical facts”; and it’s implied as well in all too many research publications. One of Jack Good’s persistent campaigns was against the assignment of “p = 0” or “p = 1” to anything within the ken of human beings.*

**is**A worse problem with “p” values is their reliance on the null hypothesis: testing what one isn’t interested in instead of what one * is* interested in. Why is this done? Because it’s easy. The “normal distribution”, the “bell curve”, the Gaussian distribution, expresses the distribution of some measure around its average value when deviations from the average are owing

*, when they occur randomly. For example, toss a penny 100 times, and most often you will NOT get 50 heads and 50 tails; but you can calculate exactly how likely you are “by chance” to get 49 H and 51 T, or 100 H, or whatever (provided, of course, the coin is perfectly balanced and not biased).*

**purely to chance**So an unspoken assumption is that the quantitative measure of the thing being investigated has a distribution like the normal curve. Some other distributions have also been studied, in particular the asymmetrical Poisson distribution, but this doesn’t help with the basic problem: If you want to use a frequentist or Fisherian approach, you need to know beforehand how the possible values of the variable you want to measure are distributed; and you can’t know that. Thereby there’s an inherent uncertainty, an unreliability, built in * but that’s not commonly recognized, let alone openly acknowledged*.

Nor is that all. When the null hypothesis is taken to be disproved to some (subjective!) level of significance, the commonly drawn conclusion is that the hypothesis “being tested” is confirmed. * But that presumes this hypothesis to be the only alternative to the null hypothesis*; and there’s absolutely no warrant to assume that you thought of the only possible hypothesis capable of explaining the phenomenon you’re interested in, or the best one, the one most likely to be true.

So the Fisherian approach is beset by uncertainties, of which the most troubling are occult, hidden, not revealed when “p” values are cited. By contrast, the Bayesian approach places its subjective aspects in the open and up front, in the choice of a prior probability. Moreover, in calculating the Bayes Factor one is gauging probabilities * relative to one’s hypothesis*, and thereby one may be continually reminded that there might be other hypotheses equally or even better able to accommodate the data.

****************************

There are nowadays many expositions of Bayesian statistics and its superiority over the Fisherian because of the latter’s weaknesses. One exposition I found particularly readable is by R A J Matthews — “Facts versus Factions: The use and abuse of subjectivity in scientific research”, *European Science and Environment Forum Working Paper* (1998), reprinted (pp. 247–282) in J. Morris (ed.), *Rethinking Risk and the Precautionary Principle*, Butterworth, 2000. Matthews has also provided a concise overview of how misleading “p” values can be, increasingly so as the inherent (prior) probability of the thing you’re interested in differs from 50:50:

For example, if your initial belief is that there’s only 1 chance in 100 that something is true (moderate skepticism: an observation of it is a fluke 99 times out of 100), to establish it as real you need not a “p” value of 0.01 but of 0.000013 (1.3 x 10-5); in other words, p-values are a vastly insufficient criterion for estimating “statistical significance”.

***************************

I began this essay because Jack Good had just died, and I’m going to miss him enormously. I learned so much from him, not only about matters of probability and statistics. But I’ll mention one more example of the latter. “The birthday problem” is a commonly used demonstration of how wrong are our untrained estimates of probability: how many people do you need to gather in order to have a 50:50 chance that two of them will have the same birthday (day and month, of course, not year)? The answer, which surprises most people, is 23. A usefully simple explanation is that * the number of pairs of people*, which is what matters, increases much more rapidly than the number of people. Jack once extended that to the perhaps even more surprising numbers needed to have a 50:50 chance of 3 people with the same birthday (83), or 4 (170), or more — he gave a formula for calculating the result for any given situation (Good’s publication #2323, “Individual and global coincidences and a generalized birthday problem”,

*J. Statist. Comp. & Simul.*, 72 [2002] 18-21).

A concise but accurate, though understated, bio-obituary of Jack Good is at the Virginia Tech website . What you won’t find there is that this genius, utterly devoted to the life of the mind, was the best possible company, interested in and fascinating about everything under the sun, endlessly witty, able to find humor everywhere and to express everything humorously; a most cultured, erudite person, and also as gentle and civilized and courteous and without malice as one could ever find. It is an extraordinary blessing and gift to have known him.

## Darin said

“It actually measures the probability that the “null hypothesis” is true, the null hypothesis being that the claimed relationship does not exist.”

In frequentist statistics, the parameters are always considered fixed, so the “probability” the null hypothesis is true is either 0 or 1. (Only Bayesians consider parameters as random quantities.) The

p-value is the probability that a result at least as extreme as the one actually observed, would be observed, under the assumption that the null hypothesis is true.This is why Bayesians consider frequentist hypothesis testing dubious: We really want to know P(Null|Data observed), when frequentist hypothesis testing actually gives us P(Data at least as extreme as observed|Null), something quite different. There are even cases where the former approaches unity as the latter approaches vanishing.

“If you want to use a frequentist or Fisherian approach, you need to know beforehand how the possible values of the variable you want to measure are distributed; and you can’t know that.”

This is only true for so-called “parametric” hypothesis tests. But there are non-parametric frequentist tests as well, that don’t assume a type of distribution.

## Henry Bauer said

Darin:

Thanks for the corrections and extensions