In my last post I discussed the theory of hypothesis testing, and specifically how it does not support the idea of “proving the null hypothesis“. But I was told that it was only theory and that in practice you could argue that failing to prove the null was, in fact, support for the null if you had high power. The idea of increasing power (by increasing the sample size) in order to increase support for the null was also thrown around. Of course, you can argue whatever you like, but that doesn’t make it so. And in this case we have statistical theory on our side.

We know that a test of statistical significance should have a high probability of rejecting the null hypothesis when it is false (with a fixed probability of rejecting the null, the significance level, when it is true). This probability is called power, and it guards against false negatives (whereas the significance level guards against false positives). The question is whether we can use high values of power to prove the null, within the context of hypothesis testing. A great article on the subject (only six pages long, with references) is Abuse of Power [PDF], which I’ll use as my main reference.

**Observe this**

Proponents of using power to build evidence in support of the null calculate power using the observed value of the test statistic, calling it the observed power (in the same way a p-value is called the observed significance). High values of observed power are interpreted as strong support for the null; low values of observed power are interpreted as weak support for the null. We’ll come back to this shortly to demonstrate the false logic behind this interpretation.

For every value of observed power there is a unique p-value, and vice versa. In other words the observed power is a one-to-one function of the p-value—inferences drawn from one of these observed values must, therefore, coincide with the other. Also, observed power is inversely proportional to the p-value. That is, low p-values coincide with high values of observed power; high p-values coincide with low values of observed power.

Now let’s compare the interpretation of the observed power from those hoping to support the null against the interpretation of the p-value (provided by frequentist statistics). A high value of observed power is interpreted as strong support for the null, which coincides with a low p-value interpreted as strong support *against* the null (strong yet contradictory statements); a low value of observed power is interpreted as weak support for the null, which coincides with a high p-value interpreted as weak support *against* the null (weak yet also contradictory statements).

**Say that again**

Consider two experiments in which you failed to reject the null of no treatment effects, but in which the first experiment achieved a higher value of observed power than the second. Using the interpretation of observed power above, you would conclude that the first experiment with higher observed power provided stronger evidence in support of the null than the second experiment. But higher power means a lower p-value, and therefore you would conclude the first experiment provided stronger evidence *against* the null. These are contradictory conclusions, and only the interpretation of p-values can be called a hypothesis test (supported by frequentist statistics).

There are variants on this idea of observable power, such as detectable or significant effect size, but they’re logically flawed in the same way described above. And we could compare power analysis to confidence intervals, but the point is that nothing is gained from considering power calculations once you have a confidence interval. Power calculations should be reserved to planning the sample size of future studies, and not for making inferences about studies that have already taken place.