# Luk Arbuckle

## But you can show equivalence

In hypothesis testing on 7 November 2008 at 10:49 am

Hopefully it’s clear from previous posts that you can’t prove the null, and you can’t use power to build support for the null.  And this confusion is one reason I don’t like the term “accepting” the null hypothesis.  The question remains, however, of what you can do with a hypothesis that fits what you would normally consider a “null”, but that you would actually like to prove.

To flip the role you would normally attribute to a null hypothesis with that of an alternative hypothesis, you probably need to consider an equivalence test.  First you have to nail down an effect size, that is, the maximum amount the parameter can deviate by (positive or negative) in the experiment in order to conclude that it is of no practical or scientific importance.  Even if you’re not doing an equivalence test, this question is important in determining sample size because you want to be sure your results are both statistically and scientifically significant (but calculating sample size [PDF] is the subject for a future blog post).

What’s the difference?
In an equivalence test you take your null hypothesis to be non-equivalence.  That is, that the absolute value of the parameter under consideration is greater than or equal to the effect size (the parameter is less than or equal to the negative of the effect size, or greater than or equal to the effect size).  The alternative is, therefore, that the absolute value of the parameter is less than the effect size.  Note that we don’t care if the parameter has a positive or negative effect—the goal is to reject the null hypothesis so that you can conclude that the effect is not of practical or scientific importance (although there are one-way equivalence tests as well).

For example, consider a treatment that is believed to be no better or worse than a placebo.  The effect size should define the range of values within which the actual treatment effect can be considered to be of no scientific importance (equivalent to the placebo).  The null—that there is a scientifically important difference between treatment and placebo—will be rejected if the treatment effect is found to be larger than the effect size.  Remember that we don’t care if the treatment has a positive or negative effect compared to the placebo in this example, since our goal is to reject the null of no effect either way.

Two for one
An equivalence test is essentially two one-tailed tests—one test to determine that there is no scientifically important positive effect (it’s no better), and a second test to determine that there is no scientifically important negative effect (it’s no worse).  And, as it turns out, the equivalence test is disjoint with a test of significance so that you can test both at the same significance level.  Just to be clear, the test of significance would have null equal to zero (no treatment effect), and alternative greater than zero (some positive or negative treatment effect).

My focus in this and the last two posts was on hypothesis testing, even though confidence intervals are often preferred for making inferences. This is a reflection of the debate I was dragged into, not of personal preference.  If you’re interested, Nick Barrowman shared a link (in the comments to a previous post) to a website that discusses equivalence testing and confidence intervals (although I don’t agree with their comments that equivalence from the perspective of statistical significance is convoluted).  Regardless, the debate is over (at least for us).

1. If you want to prove the null, you have to abandon standard statistical theory, which is not a normative (mathematically correct) theory of probabilistic inference and use a Bayesian analysis, which is the normative theory of probabilistic inference. If it were in fact impossible to prove the null, then the foundations of physics would crumble, since most of the basic laws/assumptions/hypotheses of physics are invariance laws, which are null hypotheses (e.g., the first law of thermodynamics: the energy in a closed physical system cannot be increased or decreased by any manipulation whatsoever)

In the standard statistical framework, there is only one horse in the race, namely, the null, and the rules specify that this horse–the only horse in the race–is not allowed to win. That nonsensical state of affairs tells you all you need to know about the logical foundations of the standard approach.

In a Bayesian ananlysis, there is at least one other horse in the race, that is, there is some SPECIFIED alternative to the null, and either horse can win, that is, the data may favor either hypothesis to an arbitrarily great extent. In a Bayesian analysis, it may be shown that–GIVEN THE AVAILABLE DATA–the null is literally unbeatable, that is, no vaguer alternative to it (e.g, the experimental manipulation had SOME effect) is as likely as is the null (that the experimental manipulation had no effect). For fuller explanations, see

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. J. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. Pychonomic Bulletin & Review, 16, 225-237.

and

Gallistel, C. R. (2009). The importance of proving the null. Psychological Review, 116(2), 439-453.

or standard texts on Bayesian inference, such as:

Jaynes, E. T. (2003). Probability theory: The logic of science. New York: Cambridge University Press.

or

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795.

2. I have serious reservations with regards to the above comments. Frequentists can be described as defining probability in terms of objective properties; Bayeians can be described as defining probability in terms of subjective properties. But both views are well founded in probability theory, and therefore “mathematically correct”.

I may be wrong, but I believe the Bayes factor has been misinterpreted. The paper cited above [PDF] states that the Bayes factor is “the odds favoring one hypothesis over the other”. And that, “unlike traditional p values, these odds may favor the null”. Consider, however, the following, from Bayes Factors: What They Are and What They Are Not [SiteSeerX]

Bayes factors are not coherent measures of support […]. What the Bayes factor actually measures is the change in the odds in favor of the hypothesis when going from the prior to the posterior. […] Just because the data increase the support for a hypothesis H relative to its complement does not necessarily make H more likely than its complement, it only makes H more likely than it was a priori.

I’m admittedly skeptical because of the source (psychology), references (mostly psychology), and history (psychologists promoted the idea of using retrospective power to prove the null, which is not correct–the one-one relationship between retrospective power and p-values lead to contradictory conclusions). Psychologists are very knowledgeable of statistical methodology, but statistical theory and methods should be peer reviewed in journals of statistics.

And before anyone thinks they might be able to prove the null using posterior p-values, I have another paper on hand, namely Posterior Predictive P-Values: What They Are and What They Are Not [subscription required].

3. Indeed, what the Bayes Factor measures is the change in the odds in going from the prior to the posterior in the light of the data. Equivalently, it measures the strength of the support that the data under consideration (the data from which the likelihood function was computed) provide to one hypothesis versus the other. These odds are the posterior odds only if the prior odds are taken to be even. However, when the prior odds are not taken to be even, it is trivial to obtain the posterior odds, because they are simply the product of the Bayes Factor and the prior odds.

The non-normativeness of null hypothesis significance testing follows directly from the fact–which is not in dispute–that it cannot provide support for the null hypothesis. The problem lies not with the mathematics involved but rather with the formulation of the inference problem. Inference must be between competing hypotheses. If there is only one hypothesis on the table, then what is there to infer? And if the rules of inference specify that one can never conclude in favor of the only hypothesis that enters into the inference procedure (the only hypothesis that is allowed on the table), how can that be a logically coherent inference procedure?

Edwin Thompson Jaynes was Wayman Crow Distinguished Professor of Physics at Washington University.

Harold Jeffreys, FRS, was a mathematician and Plumian Professor of Astronomy at Cambridge.

R.E. Kass is professor of statistics at Carnegie Mellon, with a PhD in statistics from the University of Chicago and a PhD thesis entitled: The Reimannian structure of model spaces: A geometrical approach to inference

Adrian E. Raftery is Blumstein-Jordan Professor of Statistics at the University of Washington. His recent work focuses on weather forecasting, on cluster analysis, and on Bayesian model averaging

Their papers have been published almost entirely in peer reviewed journals of statistics. We poor psychologists are crouching on the shoulders of these giants

4. Only with two simple hypotheses are these equivalent, i.e., that the Bayes factor can measure the support for one simple hypothesis versus its complement, in light of the data. The P value in this case can provide the same measure, as it is a monotone function of the Bayes factor (therefore both Bayesians and non-Bayesians can agree). But, even then, “just because the data increase the support for a hypothesis H relative to its complement does not necessarily make H more likely than its complement, it only makes H more likely than it was a priori.” The Bayes factor measures the strength of the evidence, relative to the hypotheses. It cannot be interpreted independent of the prior odds.

For more complicated hypotheses (namely, where at least one of the hypotheses is composite), “interpreting the Bayes factor as a measure of support is incoherent”, because Bayes factors are not monotone in the hypothesis. And, if you prefer, the authors give a decision theoretic justification for requiring coherence in a measure of support. The point is that the authors have shown a logical flaw in using Bayes factors as measures of support, in response to papers such as that by Kass and Raftery.

The book by Jaynes and Bretthorst is not a “standard text”, and often reads more like an opinion piece. It’s a book about assumptions and philosophy, and some of it is controversial (being rittled with polemics). Not to mention errors that were missed, possibly since Jaynes (the principle author) died before its completion (although even then, books are notirious for errors, given their size and lack of detailed peer-review). In a way, this reminds me of the book by Cohen that, as I recall, first proposed using retrospective power to prove the null, without the idea having been peer-reviewed in a journal of statistics.

I was not suggesting that you can’t find reputable scientists that have claimed, and may still believe, that you can prove the null. But there is no consensus that I see in the statistical journals (except, perhaps, that you probably can’t prove the null). I’m not going to defend hypothesis testing–most statisticians, including myself, prefer confidence intervals–although I briefly discussed the logical framework in a previous post. It’s important to understand the assumptions, and limitations, but my post was with regards to a debate in which someone wished to use retrospective power to prove the null (but you can’t prove something you’ve assumed to be true).

5. […] to be aware of these common fallacies. Additionally, I have to give credit to this wonderful series of blog posts that inspired this one. To conclude, […]