Luk Arbuckle

Posts Tagged ‘hypothesis testing’

Absence of evidence is evidence of absence?

In hypothesis testing on 25 January 2009 at 5:15 pm

In the context of logical reasoning, and using Bayesian probability, you can argue that absence of evidence is, in fact, evidence of absence.  Namely, not being able to find evidence for something changes your thinking and can result in you reversing your original hypothesis  entirely.   For example, failing to find evidence that some medical treatment works, you may begin to think that it doesn’t work.  Maybe it’s a placebo.  You could, therefore, decide to change your hypothesis and look to create an experiment disproving it’s effectiveness.  Of course, there are no “priors”, in the Bayesian sense, in the frequentist interpretation of hypothesis testing.  But, just the same, what does this say about the maxim used in statistical hypothesis testing, that absence of evidence is not evidence of absence?  Nick Barrowman has an interesting post on the topic, and I wanted to participate in the discussion:

I interpret “absence of evidence is not evidence of absence” (in the context of hypothesis testing) to mean “failing to reject the null is not equivalent to accepting the null.” I’m thinking of the null hypothesis of “no treatment effects”. You don’t have significant evidence to reject the null, and therefore an absence of evidence of treatment effects, but this is not the same thing as saying you have evidence of no treatment effects (because of the formulation of hypothesis testing, flawed as it may be).

One point, which I believe you are alluding to, is that an equivalence test would be more appropriate. But I’ve heard some statisticians and researchers try and argue that they could use retrospective power to “prove the null” when they are faced with non-significant results. See Abuse of Power [PDF] (this paper was the nail in the coffin, if you will, in a previous discussion I was having with a group of statisticians).

I believe the maxim is simply trying to emphasize that the p-value is calculated having assumed the null, and therefore can’t be used as evidence for the null (as it would be a circular argument). Trying to make more out of the maxim than this may be the sticking point. It’s too simple, and therefore flawed when taken out of this limited context.

I agree with your previous post. If I’m not mistaken, one point was that failing to reject the null means the confidence interval contains a value of “no effect”. But there could still be differences of practical importance, and so failing to reject the null is not the same as showing there’s no effect. The “statistical note” from the BMJ, Absence of evidence is not evidence of absence, seems to be saying the same thing: absence of evidence of a difference is not evidence that there is no difference. Or, absence of evidence of an effect is not evidence of no effect. Because you can’t prove the null using a hypothesis test (you instead need an equivalence test).

I entirely agree with Nick that confidence intervals are more clear.   We can’t forget that hypothesis testing, although constructed like a proof by contradiction, has uncertainty (in the form of Type I errors, rejecting the null when it is true, and Type II errors, failing to reject the null when it is false).  It’s interpretation is, therefore, muddied by uncertainty and inductive reasoning (I had actually forgotten what Nick had written with regards to Popper and Fisher when I was commenting).  To be honest, my head is still spinning trying to make sense of all this, but it certainly is an interesting topic.

But you can show equivalence

In hypothesis testing on 7 November 2008 at 10:49 am

Hopefully it’s clear from previous posts that you can’t prove the null, and you can’t use power to build support for the null.  And this confusion is one reason I don’t like the term “accepting” the null hypothesis.  The question remains, however, of what you can do with a hypothesis that fits what you would normally consider a “null”, but that you would actually like to prove.

To flip the role you would normally attribute to a null hypothesis with that of an alternative hypothesis, you probably need to consider an equivalence test.  First you have to nail down an effect size, that is, the maximum amount the parameter can deviate by (positive or negative) in the experiment in order to conclude that it is of no practical or scientific importance.  Even if you’re not doing an equivalence test, this question is important in determining sample size because you want to be sure your results are both statistically and scientifically significant (but calculating sample size [PDF] is the subject for a future blog post).

What’s the difference?
In an equivalence test you take your null hypothesis to be non-equivalence.  That is, that the absolute value of the parameter under consideration is greater than or equal to the effect size (the parameter is less than or equal to the negative of the effect size, or greater than or equal to the effect size).  The alternative is, therefore, that the absolute value of the parameter is less than the effect size.  Note that we don’t care if the parameter has a positive or negative effect—the goal is to reject the null hypothesis so that you can conclude that the effect is not of practical or scientific importance (although there are one-way equivalence tests as well).

For example, consider a treatment that is believed to be no better or worse than a placebo.  The effect size should define the range of values within which the actual treatment effect can be considered to be of no scientific importance (equivalent to the placebo).  The null—that there is a scientifically important difference between treatment and placebo—will be rejected if the treatment effect is found to be larger than the effect size.  Remember that we don’t care if the treatment has a positive or negative effect compared to the placebo in this example, since our goal is to reject the null of no effect either way.

Two for one
An equivalence test is essentially two one-tailed tests—one test to determine that there is no scientifically important positive effect (it’s no better), and a second test to determine that there is no scientifically important negative effect (it’s no worse).  And, as it turns out, the equivalence test is disjoint with a test of significance so that you can test both at the same significance level.  Just to be clear, the test of significance would have null equal to zero (no treatment effect), and alternative greater than zero (some positive or negative treatment effect).

My focus in this and the last two posts was on hypothesis testing, even though confidence intervals are often preferred for making inferences. This is a reflection of the debate I was dragged into, not of personal preference.  If you’re interested, Nick Barrowman shared a link (in the comments to a previous post) to a website that discusses equivalence testing and confidence intervals (although I don’t agree with their comments that equivalence from the perspective of statistical significance is convoluted).  Regardless, the debate is over (at least for us).

You can’t increase power to prove the null

In hypothesis testing on 31 October 2008 at 5:01 pm

In my last post I discussed the theory of hypothesis testing, and specifically how it does not support the idea of “proving the null hypothesis“.  But I was told that it was only theory and that in practice you could argue that failing to prove the null was, in fact, support for the null if you had high power.  The idea of increasing power (by increasing the sample size) in order to increase support for the null was also thrown around.  Of course, you can argue whatever you like, but that doesn’t make it so.  And in this case we have statistical theory on our side.

We know that a test of statistical significance should have a high probability of rejecting the null hypothesis when it is false (with a fixed probability of rejecting the null, the significance level, when it is true).  This probability is called power, and it guards against false negatives (whereas the significance level guards against false positives).  The question is whether we can use high values of power to prove the null, within the context of hypothesis testing.  A great article on the subject (only six pages long, with references) is Abuse of Power [PDF], which I’ll use as my main reference.

Observe this
Proponents of using power to build evidence in support of the null calculate power using the observed value of the test statistic, calling it the observed power (in the same way a p-value is called the observed significance).  High values of observed power are interpreted as strong support for the null; low values of observed power are interpreted as weak support for the null.  We’ll come back to this shortly to demonstrate the false logic behind this interpretation.

Example of Observed Power vs P-Value for a One-Tailed Z Test in Which α is Set to .05.

Low p-value, high power; high p-value, low power. But what does this actually tell you?

For every value of observed power there is a unique p-value, and vice versa.  In other words the observed power is a one-to-one function of the p-value—inferences drawn from one of these observed values must, therefore, coincide with the other.  Also, observed power is inversely proportional to the p-value.  That is, low p-values coincide with high values of observed power; high p-values coincide with low values of observed power.

Now let’s compare the interpretation of the observed power from those hoping to support the null against the interpretation of the p-value (provided by frequentist statistics).  A high value of observed power is interpreted as strong support for the null, which coincides with a low p-value interpreted as strong support against the null (strong yet contradictory statements); a low value of observed power is interpreted as weak support for the null, which coincides with a high p-value interpreted as weak support against the null (weak yet also contradictory statements).

Say that again
Consider two experiments in which you failed to reject the null of no treatment effects, but in which the first experiment achieved a higher value of observed power than the second.  Using the interpretation of observed power above, you would conclude that the first experiment with higher observed power provided stronger evidence in support of the null than the second experiment.  But higher power means a lower p-value, and therefore you would conclude the first experiment provided stronger evidence against the null.  These are contradictory conclusions, and only the interpretation of p-values can be called a hypothesis test (supported by frequentist statistics).

There are variants on this idea of observable power, such as detectable or significant effect size, but they’re logically flawed in the same way described above.  And we could compare power analysis to confidence intervals, but the point is that nothing is gained from considering power calculations once you have a confidence interval. Power calculations should be reserved to planning the sample size of future studies, and not for making inferences about studies that have already taken place.

You can’t prove the null by not rejecting it

In hypothesis testing on 25 October 2008 at 10:53 am

I was (willingly) dragged into a discussion about “proving the null hypothesis” that I have to include here. But it will end up being three posts since there are different issues to address (basic theory, power, and equivalence). First step is to discuss the theory of hypothesis testing, what it is and what it isn’t, as it’s fundamental to understanding the problem of providing evidence to support the null.

Hypothesis testing is confusing in part because the logical basis on which the concept rests is not usually described: it’s a proof by contradiction. For example, if you want to prove that a treatment has an effect, you start by assuming there are no treatment effects—this is the null hypothesis. You assume the null and use it to calculate a p-value (the probability of measuring a treatment effect at least as strong as what was observed, given that there are no treatment effects). A small p-value is a contradiction to the assumption that the null is true. “Proof”, here, is used loosely—it’s strong enough evidence to cast doubt on the null.

The p-value is based on the assumption that the null hypothesis is true. Trying to prove the null using a p-value is, therefore, trying to prove it’s true based on the assumption that it’s true. But we can’t prove the assumption that the null is true as we have already assumed it. The idea of a hypothesis test is to assume the null is true, then use that assumption to build a contradiction against it being true.

Absence of evidence
No conclusion can be drawn if you fail to build a contradiction. Another way to think of this is to remember that the p-value measures evidence against the null, not for it.  And therefore lack of evidence to reject the null does not imply sufficient evidence to support it.  Absence of evidence is not evidence of absence. Some would like to believe that the inability to reject the null suggests the null may be true (and they try to support this claim with high sample sizes, or high power, which I’ll address in a subsequent post).

Rejecting the null leaves you with a lot of alternatives.

Rejecting the null leaves you with a lot of alternatives. One down, an infinite number to go!

Failing to reject the null is a weak outcome, and that’s the point. It’s no better than failing to reject the innumerable models that were not tested. Although the null and alternative hypotheses represent a dichotomy (either one is true or the other), they underlie a parameter space. The alternative represents the complement of the space defined by the null, that is, the parameter space minus the null.

In the context of treatment effects, the null is no treatment effects, which represents a single point in the parameter space. But the alternative—some degree of treatment effects—is the complement, which is every point in the parameter space minus the null. If you want to use the theory of hypothesis testing in this way to “prove” the null, you would have to reject probability models for every point in the alternative, which is infinite! Even if you could justify taking a finite number of probability models, with some practical significance to each, it should be clear that it’s not just a matter of failing to reject the null.

I would like to follow up with a discussion of tests of equivalence, but first I need to attack the notion of increasing power to prove the null. As convincing as the above arguments may be, I was told that it’s just theory and that in practice you could get away with a lot less. As though we can ignore theory and reverse the notion of a hypothesis test without demonstrating equivalence. But they use the same faulty logic described above to justify it: if you can’t find a contradiction, then it must be correct.  Game on.

No one understands error bars

In estimation on 26 September 2008 at 12:04 pm

There’s a common misconception regarding error bars: overlap means no statistical significance.  Checking statistical significance is not the only relevant piece of information that you can get from error bars (otherwise what would be the point) but it’s the first thing people look for when they see them in a graph.  Another common misconception is that error bars are always relevant, and should therefore always be present in a graph of experimental results.  If only it were that simple.  

Who’s laughing now
A professor of psychology was criticized recently when he posted an article online with a graph that did not include error bars.  He followed up with poll to see if readers understood error bars (most didn’t), and then posted an article about how most researchers don’t understand error bars.  He based his post on a relatively large study (of almost 500 participants) that tested researchers that had published in psychology, neuroscience, and medical journals.

One of the articles cited in the study is Inference by Eye: Confidence Intervals and How to Read Pictures of Data [PDF] by Cumming and Finch.  In it the authors describe some pitfalls relating to making inferences from error bars (for both confidence intervals and standard errors).  And they describe rules of thumb (what the authors call rules of eye, since they are rules for making visual inferences).  But note the fine-print: the rules are for two-sided confidence intervals on the mean, with a normally distributed population, used for making single inferences.

Pitfalls
Before you can judge error bars, you need to know what they represent: a percent confidence interval, standard error, or standard deviation.   Then you need to worry about whether the data is independent (for between-subject comparisons), or paired (such as repeated tests, for within-subject comparisons), and the reason error bars are being reported (for between-subject comparisons, a meta-analysis in which results are pooled, or just to confuse). And these points are not always made clear in figure captions.

For paired or repeated data, you probably don’t care about the error bars on an independent variable.  For example, confidence intervals on the means are of little value for visual inspections—you want to look at the confidence interval on the mean of the differences (which depends on correlation between the confidence intervals on the individual means, which can’t be determined visually).   In other words error bars on the individual measurements probably shouldn’t be there since they’re misleading.

Rules of thumb
For independent means, error bars representing 95% confidence intervals can overlap and still be statistically significant at the 5% level.  Assuming normality, the overlap can be as much as one quarter of the average length of the two intervals.  For statistical significance at the 1% level the intervals should not overlap.  However these general rules only apply to sample sizes greater than 10, and the confidence intervals can’t differ in length by more than a factor of two.

For independent means and error bars representing standard errors, there should be a gap between the error bars that is at least equal to the average of the two standard errors for statistical significance at the 5% level.  This gap has to be at least double for statistical significance at the 1% level.  But it’s probably easier to remember that doubling the length of the standard error bars will give you about a 95% confidence interval (from which you can then apply the rules from the previous paragraph).  Again, these rules only apply for samples sizes greater than 10.

Constant vigilance
It’s suggested that some researches may prefer to use standard error bars because they are shorter, and that the researchers are therefore “capitalizing on their readers’ presumed lack of understanding” of error bars.  And recall that there is no standard for error bars (even the percent confidence interval can vary).  So the responsibility is yours, as the reader, to be vigilant and check the details.  Of course, if you’re the one reporting on the data, you should be clear and honest about your results and their implications (directly in the figure captions).

A final note about other information you can get from error bars.  The JMP blog posted an article about what you can use error bars for (where I first learned of the discussion, actually), using different types of error bars depending on the purpose (namely, representing variation in a sample of data, uncertainty in a sample statistic, and uncertainty in several sample statistics).  It’s a topic onto itself but it’s interesting to see the different ways you can display the (more or less) same information to get specific ideas across.  And that’s the point: error bars are useful when they convey useful information (in a relatively simple way).

Accept the null hypothesis, or fail to reject it?

In hypothesis testing on 8 September 2008 at 1:17 pm

I’m doing a review of basic statistics since I’ll be helping undergrad students, in one-on-one consultation and teaching labs, understand math and stats concepts introduced in their classes.  I also find it useful to step outside the realm of mathematics to interpret and understand the material from a more general perspective.  As such, I’ll likely post on several topics from the perspective of understanding and applying basic statistics.

In my review I’ve started reading The Little Handbook of Statistical Practice by Dallal.  I jumped to Significance Tests to sample the handbook and because, quite frankly, I felt there was something I was conceptually missing about hypothesis testing as an undergrad.  I could churn out the answers, as required, but never felt it was well absorbed.   Dallal’s discussion turned on a light bulb in my head:

Null hypothesis are never accepted. We either reject them or fail to reject them. The distinction between “acceptance” and “failure to reject” is best understood in terms of confidence intervals. Failing to reject a hypothesis means a confidence interval contains a value of “no difference”. However, the data may also be consistent with differences of practical importance. Hence, failing to reject H0 does not mean that we have shown that there is no difference (accept H0).

I like Dallal’s discussion of the topic because of the emphasis on confidence intervals and the distinction between accepting the null and failing to reject it.  It seems odd that I would never have heard of this in my previous studies.  I turned to my intermediate undergrad-level text (by Miller and Miller) to see if I had simply forgotten, and they state the problem as being “to accept the null hypothesis or to reject it in favor of the alternative hypothesis.”  They take the (possibly common) approach of considering a hypothesis test to be a problem in which one of the null hypothesis or the alternative hypothesis will be asserted.  This approach leaves me wholly unsatisfied.

Related posts:
You can’t prove the null by not rejecting it
You can’t increase power to prove the null
But you can show equivalence

I instead turned to my intermediate grad-level text (by Casella and Berger) for more insight: “On a philosophical level, some people worry about the distinction […] between “accepting” H0 and “not rejecting” H0.”  This sounds promising.  The authors continue with some details and finally state that “for the most part, we will not be concerned with these issues.”  Ugh.  What a disappointing end to what could (or should) have been an interesting discussion.

If we don’t reject the null hypothesis, we don’t conclude that it’s true.  We simply recognize that the null hypothesis is a possibility (it’s something that we could observe).  I believe this is what is meant by “accepting” the null hypothesis—we accept that it is a possibility (the term “accept” is far from precise, after all).   An older text (by Crow, Davis, and Maxfield) reminded me, as did Dallal, that Fisher did not use an alternative hypothesis, and therefore there was no concept of “accepting” an alternative in his construction of significance tests.  Maybe this has something to do with the use of this imprecise term for both H0 and H1 (and somehow involving the “Neyman-Pearson school of frequentist statistics”, which puts an emphasis on the alternative hypothesis, as Dallal points out).

Many texts, and perhaps analysts, discuss “accepting” the null hypothesis as though they were stating that the null hypothesis were in fact true.  Showing that the null hypothesis is true is not the same thing as failing to reject it.  There is a relatively low probability (by construction) of rejecting the null hypothesis when it is in fact true (Type I error).  But if we fail to reject the null hypothesis, what’s the probability of it being true?  Dallal provides an interesting discussion of how “failing to find an effect is different from showing there is no effect!”  Until I find a good counter argument, I’m going to be irked when I hear or read the use of “accepting the null”.

Statistically relevant or statistically significant?

In hypothesis testing on 31 July 2008 at 10:25 am

I came across the use of “statistically relevant” in something I was reading online and, since I had never heard of it before, decided to look it up. But it’s usage varies. Some use it to mean statistically significant, which seems wrong since we have a precise definition of that, and in other cases I’m not sure what they mean, exactly.

I asked a few people in applied statistics and they had never seen the use of statistically relevant, or come across a formal definition. A long conversation ensued as we attempted to figure out its precise meaning. The terms practical significance came up, meaning something that is statistically significant and also of practical use. Medical or health scientists sometimes call this biological significance. The terms practical (or biological) relevance also came up for the case that something is not statistically significant but still practical.

Enter philosophy
As it happens, the definition of statistical relevance is from philosophy (bear with me). The property C is statistically relevant to B within A if and only if P(B, A&C) does not equal P(B, A-C). The definition is then used in combination with a partitioning of A via a property C to create a model that states that if P(B, A&C) > P(B, A) then C explains B. It’s a model trying to define what constitutes a “good” explanation.

We can say that “copper (C) is statistically relevant to things that melt at 1083 degrees Celsius (B) within the class of metals (A)”. Considering the definition, we have that P(B, A&C) = 1 (it melts at 1083 and is copper) and, given that no other metal melts at 1083 degrees, P(B, A-C) = 0 (it melts at 1083 and is a metal that is not copper), which implies statistical relevance.

Note that property C in the above example partitions the reference set A with (A&C) and (A-C), and P(B, A&C) = 1 > P(B, A) (since copper is the only metal that melts at 1083, and there are currently 86 known metals, the probability that it melts at 1083 and is metal is 1/86). Therefore, using this model of a good explanation, we can say that it melts at 1083 degrees because it is copper (or, following the language in the model, that it is copper explains why it melts at 1083).

Correlation is not causation
What I’ve found is that people familiar with this definition from philosophy use “A is statistically relevant to B” to mean two things: (i) A is related to B (correlated), (ii) B is explained by A (causal). The definition supports (i), but I believe they’re using it incorrectly in (ii) with the model of a good explanation in mind (which, by the way, is by a researcher named Salmon).

I’m no philosophy major, but I think it’s safe to say that the terms statistically relevant should not be confused with statistically significant. Extremely low probability events can be statistically relevant, and since it’s not saying anything more than “there’s a slight correlation”, it’s not really saying all that much in the context of statistics. Terms such as practical significance, or practical relevance, seem appropriate in the contexts described above, but avoid using statistically relevant unless you, and your readers, know the definition.