Luk Arbuckle

You can’t increase power to prove the null

In hypothesis testing on 31 October 2008 at 5:01 pm

In my last post I discussed the theory of hypothesis testing, and specifically how it does not support the idea of “proving the null hypothesis“.  But I was told that it was only theory and that in practice you could argue that failing to prove the null was, in fact, support for the null if you had high power.  The idea of increasing power (by increasing the sample size) in order to increase support for the null was also thrown around.  Of course, you can argue whatever you like, but that doesn’t make it so.  And in this case we have statistical theory on our side.

We know that a test of statistical significance should have a high probability of rejecting the null hypothesis when it is false (with a fixed probability of rejecting the null, the significance level, when it is true).  This probability is called power, and it guards against false negatives (whereas the significance level guards against false positives).  The question is whether we can use high values of power to prove the null, within the context of hypothesis testing.  A great article on the subject (only six pages long, with references) is Abuse of Power [PDF], which I’ll use as my main reference.

Observe this
Proponents of using power to build evidence in support of the null calculate power using the observed value of the test statistic, calling it the observed power (in the same way a p-value is called the observed significance).  High values of observed power are interpreted as strong support for the null; low values of observed power are interpreted as weak support for the null.  We’ll come back to this shortly to demonstrate the false logic behind this interpretation.

Example of Observed Power vs P-Value for a One-Tailed Z Test in Which α is Set to .05.

Low p-value, high power; high p-value, low power. But what does this actually tell you?

For every value of observed power there is a unique p-value, and vice versa.  In other words the observed power is a one-to-one function of the p-value—inferences drawn from one of these observed values must, therefore, coincide with the other.  Also, observed power is inversely proportional to the p-value.  That is, low p-values coincide with high values of observed power; high p-values coincide with low values of observed power.

Now let’s compare the interpretation of the observed power from those hoping to support the null against the interpretation of the p-value (provided by frequentist statistics).  A high value of observed power is interpreted as strong support for the null, which coincides with a low p-value interpreted as strong support against the null (strong yet contradictory statements); a low value of observed power is interpreted as weak support for the null, which coincides with a high p-value interpreted as weak support against the null (weak yet also contradictory statements).

Say that again
Consider two experiments in which you failed to reject the null of no treatment effects, but in which the first experiment achieved a higher value of observed power than the second.  Using the interpretation of observed power above, you would conclude that the first experiment with higher observed power provided stronger evidence in support of the null than the second experiment.  But higher power means a lower p-value, and therefore you would conclude the first experiment provided stronger evidence against the null.  These are contradictory conclusions, and only the interpretation of p-values can be called a hypothesis test (supported by frequentist statistics).

There are variants on this idea of observable power, such as detectable or significant effect size, but they’re logically flawed in the same way described above.  And we could compare power analysis to confidence intervals, but the point is that nothing is gained from considering power calculations once you have a confidence interval. Power calculations should be reserved to planning the sample size of future studies, and not for making inferences about studies that have already taken place.

  1. This is incorrect. It is possible to have both low alpha (determined your p-value) and a strong power. Your sample size must increase to reflect this. Check out paper on sample size. My statistics prof, who taught George Casella (author of a stats book you mentioned) clarified this commmon mistake for me last year.

  2. I’m not sure what you are trying to say, as it sounds like we agree: low p-value and high power, as shown in the figure. The point is that you can’t claim high power supports the null, given that the low p-value rejects the null.

  3. To correctly appeal to “observed power” in appraising negative results, you need to take into account specific parametric discrepancies from the null that are ruled out, analogous to looking at the upper CI bound (values in excess of the bound being ruled out at the given level). I’m referring here to a case of one-sided testing in the positive direction. You speak of power, but it has to be power against a specific alternative parameter value. You cannot leave these other features vague, else you get the confusions you mention. Sorry, I don’t have time to spell this out here, but came across this today in passing. Possibly see where I’ve written much more about this:


  4. The point is that there is a one-one relationship between p-value and power. The rest just sounds like special pleading, but that may just be that I can’t find anything on your blog that explains further.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: