# Luk Arbuckle

## Absence of evidence is evidence of absence?

In hypothesis testing on 25 January 2009 at 5:15 pm

In the context of logical reasoning, and using Bayesian probability, you can argue that absence of evidence is, in fact, evidence of absence.  Namely, not being able to find evidence for something changes your thinking and can result in you reversing your original hypothesis  entirely.   For example, failing to find evidence that some medical treatment works, you may begin to think that it doesn’t work.  Maybe it’s a placebo.  You could, therefore, decide to change your hypothesis and look to create an experiment disproving it’s effectiveness.  Of course, there are no “priors”, in the Bayesian sense, in the frequentist interpretation of hypothesis testing.  But, just the same, what does this say about the maxim used in statistical hypothesis testing, that absence of evidence is not evidence of absence?  Nick Barrowman has an interesting post on the topic, and I wanted to participate in the discussion:

I interpret “absence of evidence is not evidence of absence” (in the context of hypothesis testing) to mean “failing to reject the null is not equivalent to accepting the null.” I’m thinking of the null hypothesis of “no treatment effects”. You don’t have significant evidence to reject the null, and therefore an absence of evidence of treatment effects, but this is not the same thing as saying you have evidence of no treatment effects (because of the formulation of hypothesis testing, flawed as it may be).

One point, which I believe you are alluding to, is that an equivalence test would be more appropriate. But I’ve heard some statisticians and researchers try and argue that they could use retrospective power to “prove the null” when they are faced with non-significant results. See Abuse of Power [PDF] (this paper was the nail in the coffin, if you will, in a previous discussion I was having with a group of statisticians).

I believe the maxim is simply trying to emphasize that the p-value is calculated having assumed the null, and therefore can’t be used as evidence for the null (as it would be a circular argument). Trying to make more out of the maxim than this may be the sticking point. It’s too simple, and therefore flawed when taken out of this limited context.

I agree with your previous post. If I’m not mistaken, one point was that failing to reject the null means the confidence interval contains a value of “no effect”. But there could still be differences of practical importance, and so failing to reject the null is not the same as showing there’s no effect. The “statistical note” from the BMJ, Absence of evidence is not evidence of absence, seems to be saying the same thing: absence of evidence of a difference is not evidence that there is no difference. Or, absence of evidence of an effect is not evidence of no effect. Because you can’t prove the null using a hypothesis test (you instead need an equivalence test).

I entirely agree with Nick that confidence intervals are more clear.   We can’t forget that hypothesis testing, although constructed like a proof by contradiction, has uncertainty (in the form of Type I errors, rejecting the null when it is true, and Type II errors, failing to reject the null when it is false).  It’s interpretation is, therefore, muddied by uncertainty and inductive reasoning (I had actually forgotten what Nick had written with regards to Popper and Fisher when I was commenting).  To be honest, my head is still spinning trying to make sense of all this, but it certainly is an interesting topic.

## That confidence interval is a random variable

In estimation on 29 September 2008 at 7:58 pm

People often confuse the meaning of the probability (or confidence) associated with a confidence interval—the probability is not that the parameter is in a particular interval, but that the intervals in repeated experiments will contain the parameter.  No wonder people get confused, as it sounds like the same thing if you’re not paying close attention to the wording.  Even then I’m not sure that it’s clear.

Take polling data for elections.  When it’s reported that a political party is currently getting a specified level of support (say 37%), with an accuracy of plus or minus some amount (say 2%), they normally state that the results are true 19 times out of 20 (that’s a 95% confidence level).  This means that if they were to repeat the polling 20 times, the true level of support for that political party would fall within 19 intervals out of 20.  It does not mean that there’s a 95% chance that the true level of support for that political party is within the range of support being quoted (35 to 39%) in that specific poll.

The intervals, they are a changin’
The point is that the probability statement is about the interval, not the parameter.  Let’s say you’re building a confidence interval of the mean. The population mean is an unknown constant, not a random variable.  The random variables are the sample mean and sample variance used to build the interval, which vary between experiments.  In other words it is the interval that varies and which can be considered a “random variable” of sorts.  Once values for the sample mean and sample variance have been calculated for an interval, it’s not correct to make probability statements about the population mean—that would imply that it’s a random variable.  The population mean is a constant that either is or isn’t in the interval.

Another point to keep in mind is that all values in a confidence interval are plausible.  So although support for a political party may be at 37% (as in the previous example), with an accuracy of plus or minus 2% the true level of support at the time of polling could be anything from 35 to 39% (with a confidence of 95%).  And if you want to compare the level of support between two parties, in general you don’t want much overlap (for statistical significance at the 5% level the overlap should be no more than 1% support in our example).

## No one understands error bars

In estimation on 26 September 2008 at 12:04 pm

There’s a common misconception regarding error bars: overlap means no statistical significance.  Checking statistical significance is not the only relevant piece of information that you can get from error bars (otherwise what would be the point) but it’s the first thing people look for when they see them in a graph.  Another common misconception is that error bars are always relevant, and should therefore always be present in a graph of experimental results.  If only it were that simple.

Who’s laughing now
A professor of psychology was criticized recently when he posted an article online with a graph that did not include error bars.  He followed up with poll to see if readers understood error bars (most didn’t), and then posted an article about how most researchers don’t understand error bars.  He based his post on a relatively large study (of almost 500 participants) that tested researchers that had published in psychology, neuroscience, and medical journals.

One of the articles cited in the study is Inference by Eye: Confidence Intervals and How to Read Pictures of Data [PDF] by Cumming and Finch.  In it the authors describe some pitfalls relating to making inferences from error bars (for both confidence intervals and standard errors).  And they describe rules of thumb (what the authors call rules of eye, since they are rules for making visual inferences).  But note the fine-print: the rules are for two-sided confidence intervals on the mean, with a normally distributed population, used for making single inferences.

Pitfalls
Before you can judge error bars, you need to know what they represent: a percent confidence interval, standard error, or standard deviation.   Then you need to worry about whether the data is independent (for between-subject comparisons), or paired (such as repeated tests, for within-subject comparisons), and the reason error bars are being reported (for between-subject comparisons, a meta-analysis in which results are pooled, or just to confuse). And these points are not always made clear in figure captions.

For paired or repeated data, you probably don’t care about the error bars on an independent variable.  For example, confidence intervals on the means are of little value for visual inspections—you want to look at the confidence interval on the mean of the differences (which depends on correlation between the confidence intervals on the individual means, which can’t be determined visually).   In other words error bars on the individual measurements probably shouldn’t be there since they’re misleading.

Rules of thumb
For independent means, error bars representing 95% confidence intervals can overlap and still be statistically significant at the 5% level.  Assuming normality, the overlap can be as much as one quarter of the average length of the two intervals.  For statistical significance at the 1% level the intervals should not overlap.  However these general rules only apply to sample sizes greater than 10, and the confidence intervals can’t differ in length by more than a factor of two.

For independent means and error bars representing standard errors, there should be a gap between the error bars that is at least equal to the average of the two standard errors for statistical significance at the 5% level.  This gap has to be at least double for statistical significance at the 1% level.  But it’s probably easier to remember that doubling the length of the standard error bars will give you about a 95% confidence interval (from which you can then apply the rules from the previous paragraph).  Again, these rules only apply for samples sizes greater than 10.

Constant vigilance
It’s suggested that some researches may prefer to use standard error bars because they are shorter, and that the researchers are therefore “capitalizing on their readers’ presumed lack of understanding” of error bars.  And recall that there is no standard for error bars (even the percent confidence interval can vary).  So the responsibility is yours, as the reader, to be vigilant and check the details.  Of course, if you’re the one reporting on the data, you should be clear and honest about your results and their implications (directly in the figure captions).

A final note about other information you can get from error bars.  The JMP blog posted an article about what you can use error bars for (where I first learned of the discussion, actually), using different types of error bars depending on the purpose (namely, representing variation in a sample of data, uncertainty in a sample statistic, and uncertainty in several sample statistics).  It’s a topic onto itself but it’s interesting to see the different ways you can display the (more or less) same information to get specific ideas across.  And that’s the point: error bars are useful when they convey useful information (in a relatively simple way).

## Confidence, prediction, and tolerance intervals explained

In estimation on 25 May 2008 at 10:00 am

JMP, a business division of SAS, has a short seven page white paper that describes the differences between confidence, prediction, and tolerance intervals using a simple manufacturing example. Formulas are provided along with instructions for using JMP menus to calculate the interval types from a data set.

Statistical intervals help us to quantify the uncertainty surrounding the estimates that we calculate from our data, such as the mean and standard deviation. The three types of intervals presented here—confidence, prediction and tolerance—are particularly relevant for applications found in science and engineering because they allow us to make very practical claims about our sampled data.

 Related posts: That confidence interval is a random variable No one understands error bars

It’s not an eye-opening read per se, but it’s nonetheless important to understand the nuances between the different interval types. The table provided at the end, with an interpretation of each interval type for the example provided, is a good summary of the ideas presented.