# Luk Arbuckle

## But you can show equivalence

In hypothesis testing on 7 November 2008 at 10:49 am

Hopefully it’s clear from previous posts that you can’t prove the null, and you can’t use power to build support for the null.  And this confusion is one reason I don’t like the term “accepting” the null hypothesis.  The question remains, however, of what you can do with a hypothesis that fits what you would normally consider a “null”, but that you would actually like to prove.

To flip the role you would normally attribute to a null hypothesis with that of an alternative hypothesis, you probably need to consider an equivalence test.  First you have to nail down an effect size, that is, the maximum amount the parameter can deviate by (positive or negative) in the experiment in order to conclude that it is of no practical or scientific importance.  Even if you’re not doing an equivalence test, this question is important in determining sample size because you want to be sure your results are both statistically and scientifically significant (but calculating sample size [PDF] is the subject for a future blog post).

What’s the difference?
In an equivalence test you take your null hypothesis to be non-equivalence.  That is, that the absolute value of the parameter under consideration is greater than or equal to the effect size (the parameter is less than or equal to the negative of the effect size, or greater than or equal to the effect size).  The alternative is, therefore, that the absolute value of the parameter is less than the effect size.  Note that we don’t care if the parameter has a positive or negative effect—the goal is to reject the null hypothesis so that you can conclude that the effect is not of practical or scientific importance (although there are one-way equivalence tests as well).

For example, consider a treatment that is believed to be no better or worse than a placebo.  The effect size should define the range of values within which the actual treatment effect can be considered to be of no scientific importance (equivalent to the placebo).  The null—that there is a scientifically important difference between treatment and placebo—will be rejected if the treatment effect is found to be larger than the effect size.  Remember that we don’t care if the treatment has a positive or negative effect compared to the placebo in this example, since our goal is to reject the null of no effect either way.

Two for one
An equivalence test is essentially two one-tailed tests—one test to determine that there is no scientifically important positive effect (it’s no better), and a second test to determine that there is no scientifically important negative effect (it’s no worse).  And, as it turns out, the equivalence test is disjoint with a test of significance so that you can test both at the same significance level.  Just to be clear, the test of significance would have null equal to zero (no treatment effect), and alternative greater than zero (some positive or negative treatment effect).

My focus in this and the last two posts was on hypothesis testing, even though confidence intervals are often preferred for making inferences. This is a reflection of the debate I was dragged into, not of personal preference.  If you’re interested, Nick Barrowman shared a link (in the comments to a previous post) to a website that discusses equivalence testing and confidence intervals (although I don’t agree with their comments that equivalence from the perspective of statistical significance is convoluted).  Regardless, the debate is over (at least for us).

## You can’t increase power to prove the null

In hypothesis testing on 31 October 2008 at 5:01 pm

In my last post I discussed the theory of hypothesis testing, and specifically how it does not support the idea of “proving the null hypothesis“.  But I was told that it was only theory and that in practice you could argue that failing to prove the null was, in fact, support for the null if you had high power.  The idea of increasing power (by increasing the sample size) in order to increase support for the null was also thrown around.  Of course, you can argue whatever you like, but that doesn’t make it so.  And in this case we have statistical theory on our side.

We know that a test of statistical significance should have a high probability of rejecting the null hypothesis when it is false (with a fixed probability of rejecting the null, the significance level, when it is true).  This probability is called power, and it guards against false negatives (whereas the significance level guards against false positives).  The question is whether we can use high values of power to prove the null, within the context of hypothesis testing.  A great article on the subject (only six pages long, with references) is Abuse of Power [PDF], which I’ll use as my main reference.

Observe this
Proponents of using power to build evidence in support of the null calculate power using the observed value of the test statistic, calling it the observed power (in the same way a p-value is called the observed significance).  High values of observed power are interpreted as strong support for the null; low values of observed power are interpreted as weak support for the null.  We’ll come back to this shortly to demonstrate the false logic behind this interpretation.

Low p-value, high power; high p-value, low power. But what does this actually tell you?

For every value of observed power there is a unique p-value, and vice versa.  In other words the observed power is a one-to-one function of the p-value—inferences drawn from one of these observed values must, therefore, coincide with the other.  Also, observed power is inversely proportional to the p-value.  That is, low p-values coincide with high values of observed power; high p-values coincide with low values of observed power.

Now let’s compare the interpretation of the observed power from those hoping to support the null against the interpretation of the p-value (provided by frequentist statistics).  A high value of observed power is interpreted as strong support for the null, which coincides with a low p-value interpreted as strong support against the null (strong yet contradictory statements); a low value of observed power is interpreted as weak support for the null, which coincides with a high p-value interpreted as weak support against the null (weak yet also contradictory statements).

Say that again
Consider two experiments in which you failed to reject the null of no treatment effects, but in which the first experiment achieved a higher value of observed power than the second.  Using the interpretation of observed power above, you would conclude that the first experiment with higher observed power provided stronger evidence in support of the null than the second experiment.  But higher power means a lower p-value, and therefore you would conclude the first experiment provided stronger evidence against the null.  These are contradictory conclusions, and only the interpretation of p-values can be called a hypothesis test (supported by frequentist statistics).

There are variants on this idea of observable power, such as detectable or significant effect size, but they’re logically flawed in the same way described above.  And we could compare power analysis to confidence intervals, but the point is that nothing is gained from considering power calculations once you have a confidence interval. Power calculations should be reserved to planning the sample size of future studies, and not for making inferences about studies that have already taken place.

## You can’t prove the null by not rejecting it

In hypothesis testing on 25 October 2008 at 10:53 am

I was (willingly) dragged into a discussion about “proving the null hypothesis” that I have to include here. But it will end up being three posts since there are different issues to address (basic theory, power, and equivalence). First step is to discuss the theory of hypothesis testing, what it is and what it isn’t, as it’s fundamental to understanding the problem of providing evidence to support the null.

Hypothesis testing is confusing in part because the logical basis on which the concept rests is not usually described: it’s a proof by contradiction. For example, if you want to prove that a treatment has an effect, you start by assuming there are no treatment effects—this is the null hypothesis. You assume the null and use it to calculate a p-value (the probability of measuring a treatment effect at least as strong as what was observed, given that there are no treatment effects). A small p-value is a contradiction to the assumption that the null is true. “Proof”, here, is used loosely—it’s strong enough evidence to cast doubt on the null.

The p-value is based on the assumption that the null hypothesis is true. Trying to prove the null using a p-value is, therefore, trying to prove it’s true based on the assumption that it’s true. But we can’t prove the assumption that the null is true as we have already assumed it. The idea of a hypothesis test is to assume the null is true, then use that assumption to build a contradiction against it being true.

Absence of evidence
No conclusion can be drawn if you fail to build a contradiction. Another way to think of this is to remember that the p-value measures evidence against the null, not for it.  And therefore lack of evidence to reject the null does not imply sufficient evidence to support it.  Absence of evidence is not evidence of absence. Some would like to believe that the inability to reject the null suggests the null may be true (and they try to support this claim with high sample sizes, or high power, which I’ll address in a subsequent post).

Rejecting the null leaves you with a lot of alternatives. One down, an infinite number to go!

Failing to reject the null is a weak outcome, and that’s the point. It’s no better than failing to reject the innumerable models that were not tested. Although the null and alternative hypotheses represent a dichotomy (either one is true or the other), they underlie a parameter space. The alternative represents the complement of the space defined by the null, that is, the parameter space minus the null.

In the context of treatment effects, the null is no treatment effects, which represents a single point in the parameter space. But the alternative—some degree of treatment effects—is the complement, which is every point in the parameter space minus the null. If you want to use the theory of hypothesis testing in this way to “prove” the null, you would have to reject probability models for every point in the alternative, which is infinite! Even if you could justify taking a finite number of probability models, with some practical significance to each, it should be clear that it’s not just a matter of failing to reject the null.

I would like to follow up with a discussion of tests of equivalence, but first I need to attack the notion of increasing power to prove the null. As convincing as the above arguments may be, I was told that it’s just theory and that in practice you could get away with a lot less. As though we can ignore theory and reverse the notion of a hypothesis test without demonstrating equivalence. But they use the same faulty logic described above to justify it: if you can’t find a contradiction, then it must be correct.  Game on.

## Accept the null hypothesis, or fail to reject it?

In hypothesis testing on 8 September 2008 at 1:17 pm

I’m doing a review of basic statistics since I’ll be helping undergrad students, in one-on-one consultation and teaching labs, understand math and stats concepts introduced in their classes.  I also find it useful to step outside the realm of mathematics to interpret and understand the material from a more general perspective.  As such, I’ll likely post on several topics from the perspective of understanding and applying basic statistics.

In my review I’ve started reading The Little Handbook of Statistical Practice by Dallal.  I jumped to Significance Tests to sample the handbook and because, quite frankly, I felt there was something I was conceptually missing about hypothesis testing as an undergrad.  I could churn out the answers, as required, but never felt it was well absorbed.   Dallal’s discussion turned on a light bulb in my head:

Null hypothesis are never accepted. We either reject them or fail to reject them. The distinction between “acceptance” and “failure to reject” is best understood in terms of confidence intervals. Failing to reject a hypothesis means a confidence interval contains a value of “no difference”. However, the data may also be consistent with differences of practical importance. Hence, failing to reject H0 does not mean that we have shown that there is no difference (accept H0).

I like Dallal’s discussion of the topic because of the emphasis on confidence intervals and the distinction between accepting the null and failing to reject it.  It seems odd that I would never have heard of this in my previous studies.  I turned to my intermediate undergrad-level text (by Miller and Miller) to see if I had simply forgotten, and they state the problem as being “to accept the null hypothesis or to reject it in favor of the alternative hypothesis.”  They take the (possibly common) approach of considering a hypothesis test to be a problem in which one of the null hypothesis or the alternative hypothesis will be asserted.  This approach leaves me wholly unsatisfied.

I instead turned to my intermediate grad-level text (by Casella and Berger) for more insight: “On a philosophical level, some people worry about the distinction […] between “accepting” H0 and “not rejecting” H0.”  This sounds promising.  The authors continue with some details and finally state that “for the most part, we will not be concerned with these issues.”  Ugh.  What a disappointing end to what could (or should) have been an interesting discussion.

If we don’t reject the null hypothesis, we don’t conclude that it’s true.  We simply recognize that the null hypothesis is a possibility (it’s something that we could observe).  I believe this is what is meant by “accepting” the null hypothesis—we accept that it is a possibility (the term “accept” is far from precise, after all).   An older text (by Crow, Davis, and Maxfield) reminded me, as did Dallal, that Fisher did not use an alternative hypothesis, and therefore there was no concept of “accepting” an alternative in his construction of significance tests.  Maybe this has something to do with the use of this imprecise term for both H0 and H1 (and somehow involving the “Neyman-Pearson school of frequentist statistics”, which puts an emphasis on the alternative hypothesis, as Dallal points out).

Many texts, and perhaps analysts, discuss “accepting” the null hypothesis as though they were stating that the null hypothesis were in fact true.  Showing that the null hypothesis is true is not the same thing as failing to reject it.  There is a relatively low probability (by construction) of rejecting the null hypothesis when it is in fact true (Type I error).  But if we fail to reject the null hypothesis, what’s the probability of it being true?  Dallal provides an interesting discussion of how “failing to find an effect is different from showing there is no effect!”  Until I find a good counter argument, I’m going to be irked when I hear or read the use of “accepting the null”.