# Luk Arbuckle

## Absence of evidence is evidence of absence?

In hypothesis testing on 25 January 2009 at 5:15 pm

In the context of logical reasoning, and using Bayesian probability, you can argue that absence of evidence is, in fact, evidence of absence.  Namely, not being able to find evidence for something changes your thinking and can result in you reversing your original hypothesis  entirely.   For example, failing to find evidence that some medical treatment works, you may begin to think that it doesn’t work.  Maybe it’s a placebo.  You could, therefore, decide to change your hypothesis and look to create an experiment disproving it’s effectiveness.  Of course, there are no “priors”, in the Bayesian sense, in the frequentist interpretation of hypothesis testing.  But, just the same, what does this say about the maxim used in statistical hypothesis testing, that absence of evidence is not evidence of absence?  Nick Barrowman has an interesting post on the topic, and I wanted to participate in the discussion:

I interpret “absence of evidence is not evidence of absence” (in the context of hypothesis testing) to mean “failing to reject the null is not equivalent to accepting the null.” I’m thinking of the null hypothesis of “no treatment effects”. You don’t have significant evidence to reject the null, and therefore an absence of evidence of treatment effects, but this is not the same thing as saying you have evidence of no treatment effects (because of the formulation of hypothesis testing, flawed as it may be).

One point, which I believe you are alluding to, is that an equivalence test would be more appropriate. But I’ve heard some statisticians and researchers try and argue that they could use retrospective power to “prove the null” when they are faced with non-significant results. See Abuse of Power [PDF] (this paper was the nail in the coffin, if you will, in a previous discussion I was having with a group of statisticians).

I believe the maxim is simply trying to emphasize that the p-value is calculated having assumed the null, and therefore can’t be used as evidence for the null (as it would be a circular argument). Trying to make more out of the maxim than this may be the sticking point. It’s too simple, and therefore flawed when taken out of this limited context.

I agree with your previous post. If I’m not mistaken, one point was that failing to reject the null means the confidence interval contains a value of “no effect”. But there could still be differences of practical importance, and so failing to reject the null is not the same as showing there’s no effect. The “statistical note” from the BMJ, Absence of evidence is not evidence of absence, seems to be saying the same thing: absence of evidence of a difference is not evidence that there is no difference. Or, absence of evidence of an effect is not evidence of no effect. Because you can’t prove the null using a hypothesis test (you instead need an equivalence test).

I entirely agree with Nick that confidence intervals are more clear.   We can’t forget that hypothesis testing, although constructed like a proof by contradiction, has uncertainty (in the form of Type I errors, rejecting the null when it is true, and Type II errors, failing to reject the null when it is false).  It’s interpretation is, therefore, muddied by uncertainty and inductive reasoning (I had actually forgotten what Nick had written with regards to Popper and Fisher when I was commenting).  To be honest, my head is still spinning trying to make sense of all this, but it certainly is an interesting topic.

## That confidence interval is a random variable

In estimation on 29 September 2008 at 7:58 pm

People often confuse the meaning of the probability (or confidence) associated with a confidence interval—the probability is not that the parameter is in a particular interval, but that the intervals in repeated experiments will contain the parameter.  No wonder people get confused, as it sounds like the same thing if you’re not paying close attention to the wording.  Even then I’m not sure that it’s clear.

Take polling data for elections.  When it’s reported that a political party is currently getting a specified level of support (say 37%), with an accuracy of plus or minus some amount (say 2%), they normally state that the results are true 19 times out of 20 (that’s a 95% confidence level).  This means that if they were to repeat the polling 20 times, the true level of support for that political party would fall within 19 intervals out of 20.  It does not mean that there’s a 95% chance that the true level of support for that political party is within the range of support being quoted (35 to 39%) in that specific poll.

The intervals, they are a changin’
The point is that the probability statement is about the interval, not the parameter.  Let’s say you’re building a confidence interval of the mean. The population mean is an unknown constant, not a random variable.  The random variables are the sample mean and sample variance used to build the interval, which vary between experiments.  In other words it is the interval that varies and which can be considered a “random variable” of sorts.  Once values for the sample mean and sample variance have been calculated for an interval, it’s not correct to make probability statements about the population mean—that would imply that it’s a random variable.  The population mean is a constant that either is or isn’t in the interval.

Another point to keep in mind is that all values in a confidence interval are plausible.  So although support for a political party may be at 37% (as in the previous example), with an accuracy of plus or minus 2% the true level of support at the time of polling could be anything from 35 to 39% (with a confidence of 95%).  And if you want to compare the level of support between two parties, in general you don’t want much overlap (for statistical significance at the 5% level the overlap should be no more than 1% support in our example).

## Accept the null hypothesis, or fail to reject it?

In hypothesis testing on 8 September 2008 at 1:17 pm

I’m doing a review of basic statistics since I’ll be helping undergrad students, in one-on-one consultation and teaching labs, understand math and stats concepts introduced in their classes.  I also find it useful to step outside the realm of mathematics to interpret and understand the material from a more general perspective.  As such, I’ll likely post on several topics from the perspective of understanding and applying basic statistics.

In my review I’ve started reading The Little Handbook of Statistical Practice by Dallal.  I jumped to Significance Tests to sample the handbook and because, quite frankly, I felt there was something I was conceptually missing about hypothesis testing as an undergrad.  I could churn out the answers, as required, but never felt it was well absorbed.   Dallal’s discussion turned on a light bulb in my head:

Null hypothesis are never accepted. We either reject them or fail to reject them. The distinction between “acceptance” and “failure to reject” is best understood in terms of confidence intervals. Failing to reject a hypothesis means a confidence interval contains a value of “no difference”. However, the data may also be consistent with differences of practical importance. Hence, failing to reject H0 does not mean that we have shown that there is no difference (accept H0).

I like Dallal’s discussion of the topic because of the emphasis on confidence intervals and the distinction between accepting the null and failing to reject it.  It seems odd that I would never have heard of this in my previous studies.  I turned to my intermediate undergrad-level text (by Miller and Miller) to see if I had simply forgotten, and they state the problem as being “to accept the null hypothesis or to reject it in favor of the alternative hypothesis.”  They take the (possibly common) approach of considering a hypothesis test to be a problem in which one of the null hypothesis or the alternative hypothesis will be asserted.  This approach leaves me wholly unsatisfied.

I instead turned to my intermediate grad-level text (by Casella and Berger) for more insight: “On a philosophical level, some people worry about the distinction […] between “accepting” H0 and “not rejecting” H0.”  This sounds promising.  The authors continue with some details and finally state that “for the most part, we will not be concerned with these issues.”  Ugh.  What a disappointing end to what could (or should) have been an interesting discussion.

If we don’t reject the null hypothesis, we don’t conclude that it’s true.  We simply recognize that the null hypothesis is a possibility (it’s something that we could observe).  I believe this is what is meant by “accepting” the null hypothesis—we accept that it is a possibility (the term “accept” is far from precise, after all).   An older text (by Crow, Davis, and Maxfield) reminded me, as did Dallal, that Fisher did not use an alternative hypothesis, and therefore there was no concept of “accepting” an alternative in his construction of significance tests.  Maybe this has something to do with the use of this imprecise term for both H0 and H1 (and somehow involving the “Neyman-Pearson school of frequentist statistics”, which puts an emphasis on the alternative hypothesis, as Dallal points out).

Many texts, and perhaps analysts, discuss “accepting” the null hypothesis as though they were stating that the null hypothesis were in fact true.  Showing that the null hypothesis is true is not the same thing as failing to reject it.  There is a relatively low probability (by construction) of rejecting the null hypothesis when it is in fact true (Type I error).  But if we fail to reject the null hypothesis, what’s the probability of it being true?  Dallal provides an interesting discussion of how “failing to find an effect is different from showing there is no effect!”  Until I find a good counter argument, I’m going to be irked when I hear or read the use of “accepting the null”.

## Statistically relevant or statistically significant?

In hypothesis testing on 31 July 2008 at 10:25 am

I came across the use of “statistically relevant” in something I was reading online and, since I had never heard of it before, decided to look it up. But it’s usage varies. Some use it to mean statistically significant, which seems wrong since we have a precise definition of that, and in other cases I’m not sure what they mean, exactly.

I asked a few people in applied statistics and they had never seen the use of statistically relevant, or come across a formal definition. A long conversation ensued as we attempted to figure out its precise meaning. The terms practical significance came up, meaning something that is statistically significant and also of practical use. Medical or health scientists sometimes call this biological significance. The terms practical (or biological) relevance also came up for the case that something is not statistically significant but still practical.

Enter philosophy
As it happens, the definition of statistical relevance is from philosophy (bear with me). The property C is statistically relevant to B within A if and only if P(B, A&C) does not equal P(B, A-C). The definition is then used in combination with a partitioning of A via a property C to create a model that states that if P(B, A&C) > P(B, A) then C explains B. It’s a model trying to define what constitutes a “good” explanation.

We can say that “copper (C) is statistically relevant to things that melt at 1083 degrees Celsius (B) within the class of metals (A)”. Considering the definition, we have that P(B, A&C) = 1 (it melts at 1083 and is copper) and, given that no other metal melts at 1083 degrees, P(B, A-C) = 0 (it melts at 1083 and is a metal that is not copper), which implies statistical relevance.

Note that property C in the above example partitions the reference set A with (A&C) and (A-C), and P(B, A&C) = 1 > P(B, A) (since copper is the only metal that melts at 1083, and there are currently 86 known metals, the probability that it melts at 1083 and is metal is 1/86). Therefore, using this model of a good explanation, we can say that it melts at 1083 degrees because it is copper (or, following the language in the model, that it is copper explains why it melts at 1083).

Correlation is not causation
What I’ve found is that people familiar with this definition from philosophy use “A is statistically relevant to B” to mean two things: (i) A is related to B (correlated), (ii) B is explained by A (causal). The definition supports (i), but I believe they’re using it incorrectly in (ii) with the model of a good explanation in mind (which, by the way, is by a researcher named Salmon).

I’m no philosophy major, but I think it’s safe to say that the terms statistically relevant should not be confused with statistically significant. Extremely low probability events can be statistically relevant, and since it’s not saying anything more than “there’s a slight correlation”, it’s not really saying all that much in the context of statistics. Terms such as practical significance, or practical relevance, seem appropriate in the contexts described above, but avoid using statistically relevant unless you, and your readers, know the definition.