Luk Arbuckle

Posts Tagged ‘proving the null’

But you can show equivalence

In hypothesis testing on 7 November 2008 at 10:49 am

Hopefully it’s clear from previous posts that you can’t prove the null, and you can’t use power to build support for the null.  And this confusion is one reason I don’t like the term “accepting” the null hypothesis.  The question remains, however, of what you can do with a hypothesis that fits what you would normally consider a “null”, but that you would actually like to prove.

To flip the role you would normally attribute to a null hypothesis with that of an alternative hypothesis, you probably need to consider an equivalence test.  First you have to nail down an effect size, that is, the maximum amount the parameter can deviate by (positive or negative) in the experiment in order to conclude that it is of no practical or scientific importance.  Even if you’re not doing an equivalence test, this question is important in determining sample size because you want to be sure your results are both statistically and scientifically significant (but calculating sample size [PDF] is the subject for a future blog post).

What’s the difference?
In an equivalence test you take your null hypothesis to be non-equivalence.  That is, that the absolute value of the parameter under consideration is greater than or equal to the effect size (the parameter is less than or equal to the negative of the effect size, or greater than or equal to the effect size).  The alternative is, therefore, that the absolute value of the parameter is less than the effect size.  Note that we don’t care if the parameter has a positive or negative effect—the goal is to reject the null hypothesis so that you can conclude that the effect is not of practical or scientific importance (although there are one-way equivalence tests as well).

For example, consider a treatment that is believed to be no better or worse than a placebo.  The effect size should define the range of values within which the actual treatment effect can be considered to be of no scientific importance (equivalent to the placebo).  The null—that there is a scientifically important difference between treatment and placebo—will be rejected if the treatment effect is found to be larger than the effect size.  Remember that we don’t care if the treatment has a positive or negative effect compared to the placebo in this example, since our goal is to reject the null of no effect either way.

Two for one
An equivalence test is essentially two one-tailed tests—one test to determine that there is no scientifically important positive effect (it’s no better), and a second test to determine that there is no scientifically important negative effect (it’s no worse).  And, as it turns out, the equivalence test is disjoint with a test of significance so that you can test both at the same significance level.  Just to be clear, the test of significance would have null equal to zero (no treatment effect), and alternative greater than zero (some positive or negative treatment effect).

My focus in this and the last two posts was on hypothesis testing, even though confidence intervals are often preferred for making inferences. This is a reflection of the debate I was dragged into, not of personal preference.  If you’re interested, Nick Barrowman shared a link (in the comments to a previous post) to a website that discusses equivalence testing and confidence intervals (although I don’t agree with their comments that equivalence from the perspective of statistical significance is convoluted).  Regardless, the debate is over (at least for us).

You can’t increase power to prove the null

In hypothesis testing on 31 October 2008 at 5:01 pm

In my last post I discussed the theory of hypothesis testing, and specifically how it does not support the idea of “proving the null hypothesis“.  But I was told that it was only theory and that in practice you could argue that failing to prove the null was, in fact, support for the null if you had high power.  The idea of increasing power (by increasing the sample size) in order to increase support for the null was also thrown around.  Of course, you can argue whatever you like, but that doesn’t make it so.  And in this case we have statistical theory on our side.

We know that a test of statistical significance should have a high probability of rejecting the null hypothesis when it is false (with a fixed probability of rejecting the null, the significance level, when it is true).  This probability is called power, and it guards against false negatives (whereas the significance level guards against false positives).  The question is whether we can use high values of power to prove the null, within the context of hypothesis testing.  A great article on the subject (only six pages long, with references) is Abuse of Power [PDF], which I’ll use as my main reference.

Observe this
Proponents of using power to build evidence in support of the null calculate power using the observed value of the test statistic, calling it the observed power (in the same way a p-value is called the observed significance).  High values of observed power are interpreted as strong support for the null; low values of observed power are interpreted as weak support for the null.  We’ll come back to this shortly to demonstrate the false logic behind this interpretation.

Example of Observed Power vs P-Value for a One-Tailed Z Test in Which α is Set to .05.

Low p-value, high power; high p-value, low power. But what does this actually tell you?

For every value of observed power there is a unique p-value, and vice versa.  In other words the observed power is a one-to-one function of the p-value—inferences drawn from one of these observed values must, therefore, coincide with the other.  Also, observed power is inversely proportional to the p-value.  That is, low p-values coincide with high values of observed power; high p-values coincide with low values of observed power.

Now let’s compare the interpretation of the observed power from those hoping to support the null against the interpretation of the p-value (provided by frequentist statistics).  A high value of observed power is interpreted as strong support for the null, which coincides with a low p-value interpreted as strong support against the null (strong yet contradictory statements); a low value of observed power is interpreted as weak support for the null, which coincides with a high p-value interpreted as weak support against the null (weak yet also contradictory statements).

Say that again
Consider two experiments in which you failed to reject the null of no treatment effects, but in which the first experiment achieved a higher value of observed power than the second.  Using the interpretation of observed power above, you would conclude that the first experiment with higher observed power provided stronger evidence in support of the null than the second experiment.  But higher power means a lower p-value, and therefore you would conclude the first experiment provided stronger evidence against the null.  These are contradictory conclusions, and only the interpretation of p-values can be called a hypothesis test (supported by frequentist statistics).

There are variants on this idea of observable power, such as detectable or significant effect size, but they’re logically flawed in the same way described above.  And we could compare power analysis to confidence intervals, but the point is that nothing is gained from considering power calculations once you have a confidence interval. Power calculations should be reserved to planning the sample size of future studies, and not for making inferences about studies that have already taken place.

You can’t prove the null by not rejecting it

In hypothesis testing on 25 October 2008 at 10:53 am

I was (willingly) dragged into a discussion about “proving the null hypothesis” that I have to include here. But it will end up being three posts since there are different issues to address (basic theory, power, and equivalence). First step is to discuss the theory of hypothesis testing, what it is and what it isn’t, as it’s fundamental to understanding the problem of providing evidence to support the null.

Hypothesis testing is confusing in part because the logical basis on which the concept rests is not usually described: it’s a proof by contradiction. For example, if you want to prove that a treatment has an effect, you start by assuming there are no treatment effects—this is the null hypothesis. You assume the null and use it to calculate a p-value (the probability of measuring a treatment effect at least as strong as what was observed, given that there are no treatment effects). A small p-value is a contradiction to the assumption that the null is true. “Proof”, here, is used loosely—it’s strong enough evidence to cast doubt on the null.

The p-value is based on the assumption that the null hypothesis is true. Trying to prove the null using a p-value is, therefore, trying to prove it’s true based on the assumption that it’s true. But we can’t prove the assumption that the null is true as we have already assumed it. The idea of a hypothesis test is to assume the null is true, then use that assumption to build a contradiction against it being true.

Absence of evidence
No conclusion can be drawn if you fail to build a contradiction. Another way to think of this is to remember that the p-value measures evidence against the null, not for it.  And therefore lack of evidence to reject the null does not imply sufficient evidence to support it.  Absence of evidence is not evidence of absence. Some would like to believe that the inability to reject the null suggests the null may be true (and they try to support this claim with high sample sizes, or high power, which I’ll address in a subsequent post).

Rejecting the null leaves you with a lot of alternatives.

Rejecting the null leaves you with a lot of alternatives. One down, an infinite number to go!

Failing to reject the null is a weak outcome, and that’s the point. It’s no better than failing to reject the innumerable models that were not tested. Although the null and alternative hypotheses represent a dichotomy (either one is true or the other), they underlie a parameter space. The alternative represents the complement of the space defined by the null, that is, the parameter space minus the null.

In the context of treatment effects, the null is no treatment effects, which represents a single point in the parameter space. But the alternative—some degree of treatment effects—is the complement, which is every point in the parameter space minus the null. If you want to use the theory of hypothesis testing in this way to “prove” the null, you would have to reject probability models for every point in the alternative, which is infinite! Even if you could justify taking a finite number of probability models, with some practical significance to each, it should be clear that it’s not just a matter of failing to reject the null.

I would like to follow up with a discussion of tests of equivalence, but first I need to attack the notion of increasing power to prove the null. As convincing as the above arguments may be, I was told that it’s just theory and that in practice you could get away with a lot less. As though we can ignore theory and reverse the notion of a hypothesis test without demonstrating equivalence. But they use the same faulty logic described above to justify it: if you can’t find a contradiction, then it must be correct.  Game on.