I was (willingly) dragged into a discussion about “proving the null hypothesis” that I have to include here. But it will end up being three posts since there are different issues to address (basic theory, power, and equivalence). First step is to discuss the theory of hypothesis testing, what it is and what it isn’t, as it’s fundamental to understanding the problem of providing evidence to support the null.
Hypothesis testing is confusing in part because the logical basis on which the concept rests is not usually described: it’s a proof by contradiction. For example, if you want to prove that a treatment has an effect, you start by assuming there are no treatment effects—this is the null hypothesis. You assume the null and use it to calculate a p-value (the probability of measuring a treatment effect at least as strong as what was observed, given that there are no treatment effects). A small p-value is a contradiction to the assumption that the null is true. “Proof”, here, is used loosely—it’s strong enough evidence to cast doubt on the null.
The p-value is based on the assumption that the null hypothesis is true. Trying to prove the null using a p-value is, therefore, trying to prove it’s true based on the assumption that it’s true. But we can’t prove the assumption that the null is true as we have already assumed it. The idea of a hypothesis test is to assume the null is true, then use that assumption to build a contradiction against it being true.
Absence of evidence
No conclusion can be drawn if you fail to build a contradiction. Another way to think of this is to remember that the p-value measures evidence against the null, not for it. And therefore lack of evidence to reject the null does not imply sufficient evidence to support it. Absence of evidence is not evidence of absence. Some would like to believe that the inability to reject the null suggests the null may be true (and they try to support this claim with high sample sizes, or high power, which I’ll address in a subsequent post).
Failing to reject the null is a weak outcome, and that’s the point. It’s no better than failing to reject the innumerable models that were not tested. Although the null and alternative hypotheses represent a dichotomy (either one is true or the other), they underlie a parameter space. The alternative represents the complement of the space defined by the null, that is, the parameter space minus the null.
In the context of treatment effects, the null is no treatment effects, which represents a single point in the parameter space. But the alternative—some degree of treatment effects—is the complement, which is every point in the parameter space minus the null. If you want to use the theory of hypothesis testing in this way to “prove” the null, you would have to reject probability models for every point in the alternative, which is infinite! Even if you could justify taking a finite number of probability models, with some practical significance to each, it should be clear that it’s not just a matter of failing to reject the null.
I would like to follow up with a discussion of tests of equivalence, but first I need to attack the notion of increasing power to prove the null. As convincing as the above arguments may be, I was told that it’s just theory and that in practice you could get away with a lot less. As though we can ignore theory and reverse the notion of a hypothesis test without demonstrating equivalence. But they use the same faulty logic described above to justify it: if you can’t find a contradiction, then it must be correct. Game on.