Luk Arbuckle

Posts Tagged ‘statistically significant’

No one understands error bars

In estimation on 26 September 2008 at 12:04 pm

There’s a common misconception regarding error bars: overlap means no statistical significance.  Checking statistical significance is not the only relevant piece of information that you can get from error bars (otherwise what would be the point) but it’s the first thing people look for when they see them in a graph.  Another common misconception is that error bars are always relevant, and should therefore always be present in a graph of experimental results.  If only it were that simple.  

Who’s laughing now
A professor of psychology was criticized recently when he posted an article online with a graph that did not include error bars.  He followed up with poll to see if readers understood error bars (most didn’t), and then posted an article about how most researchers don’t understand error bars.  He based his post on a relatively large study (of almost 500 participants) that tested researchers that had published in psychology, neuroscience, and medical journals.

One of the articles cited in the study is Inference by Eye: Confidence Intervals and How to Read Pictures of Data [PDF] by Cumming and Finch.  In it the authors describe some pitfalls relating to making inferences from error bars (for both confidence intervals and standard errors).  And they describe rules of thumb (what the authors call rules of eye, since they are rules for making visual inferences).  But note the fine-print: the rules are for two-sided confidence intervals on the mean, with a normally distributed population, used for making single inferences.

Before you can judge error bars, you need to know what they represent: a percent confidence interval, standard error, or standard deviation.   Then you need to worry about whether the data is independent (for between-subject comparisons), or paired (such as repeated tests, for within-subject comparisons), and the reason error bars are being reported (for between-subject comparisons, a meta-analysis in which results are pooled, or just to confuse). And these points are not always made clear in figure captions.

For paired or repeated data, you probably don’t care about the error bars on an independent variable.  For example, confidence intervals on the means are of little value for visual inspections—you want to look at the confidence interval on the mean of the differences (which depends on correlation between the confidence intervals on the individual means, which can’t be determined visually).   In other words error bars on the individual measurements probably shouldn’t be there since they’re misleading.

Rules of thumb
For independent means, error bars representing 95% confidence intervals can overlap and still be statistically significant at the 5% level.  Assuming normality, the overlap can be as much as one quarter of the average length of the two intervals.  For statistical significance at the 1% level the intervals should not overlap.  However these general rules only apply to sample sizes greater than 10, and the confidence intervals can’t differ in length by more than a factor of two.

For independent means and error bars representing standard errors, there should be a gap between the error bars that is at least equal to the average of the two standard errors for statistical significance at the 5% level.  This gap has to be at least double for statistical significance at the 1% level.  But it’s probably easier to remember that doubling the length of the standard error bars will give you about a 95% confidence interval (from which you can then apply the rules from the previous paragraph).  Again, these rules only apply for samples sizes greater than 10.

Constant vigilance
It’s suggested that some researches may prefer to use standard error bars because they are shorter, and that the researchers are therefore “capitalizing on their readers’ presumed lack of understanding” of error bars.  And recall that there is no standard for error bars (even the percent confidence interval can vary).  So the responsibility is yours, as the reader, to be vigilant and check the details.  Of course, if you’re the one reporting on the data, you should be clear and honest about your results and their implications (directly in the figure captions).

A final note about other information you can get from error bars.  The JMP blog posted an article about what you can use error bars for (where I first learned of the discussion, actually), using different types of error bars depending on the purpose (namely, representing variation in a sample of data, uncertainty in a sample statistic, and uncertainty in several sample statistics).  It’s a topic onto itself but it’s interesting to see the different ways you can display the (more or less) same information to get specific ideas across.  And that’s the point: error bars are useful when they convey useful information (in a relatively simple way).

Statistically relevant or statistically significant?

In hypothesis testing on 31 July 2008 at 10:25 am

I came across the use of “statistically relevant” in something I was reading online and, since I had never heard of it before, decided to look it up. But it’s usage varies. Some use it to mean statistically significant, which seems wrong since we have a precise definition of that, and in other cases I’m not sure what they mean, exactly.

I asked a few people in applied statistics and they had never seen the use of statistically relevant, or come across a formal definition. A long conversation ensued as we attempted to figure out its precise meaning. The terms practical significance came up, meaning something that is statistically significant and also of practical use. Medical or health scientists sometimes call this biological significance. The terms practical (or biological) relevance also came up for the case that something is not statistically significant but still practical.

Enter philosophy
As it happens, the definition of statistical relevance is from philosophy (bear with me). The property C is statistically relevant to B within A if and only if P(B, A&C) does not equal P(B, A-C). The definition is then used in combination with a partitioning of A via a property C to create a model that states that if P(B, A&C) > P(B, A) then C explains B. It’s a model trying to define what constitutes a “good” explanation.

We can say that “copper (C) is statistically relevant to things that melt at 1083 degrees Celsius (B) within the class of metals (A)”. Considering the definition, we have that P(B, A&C) = 1 (it melts at 1083 and is copper) and, given that no other metal melts at 1083 degrees, P(B, A-C) = 0 (it melts at 1083 and is a metal that is not copper), which implies statistical relevance.

Note that property C in the above example partitions the reference set A with (A&C) and (A-C), and P(B, A&C) = 1 > P(B, A) (since copper is the only metal that melts at 1083, and there are currently 86 known metals, the probability that it melts at 1083 and is metal is 1/86). Therefore, using this model of a good explanation, we can say that it melts at 1083 degrees because it is copper (or, following the language in the model, that it is copper explains why it melts at 1083).

Correlation is not causation
What I’ve found is that people familiar with this definition from philosophy use “A is statistically relevant to B” to mean two things: (i) A is related to B (correlated), (ii) B is explained by A (causal). The definition supports (i), but I believe they’re using it incorrectly in (ii) with the model of a good explanation in mind (which, by the way, is by a researcher named Salmon).

I’m no philosophy major, but I think it’s safe to say that the terms statistically relevant should not be confused with statistically significant. Extremely low probability events can be statistically relevant, and since it’s not saying anything more than “there’s a slight correlation”, it’s not really saying all that much in the context of statistics. Terms such as practical significance, or practical relevance, seem appropriate in the contexts described above, but avoid using statistically relevant unless you, and your readers, know the definition.