Finally someone has written a text something like Tufte’s Visual Display of Quantitative Information but specifically for statistics. Rafe M. J. Donahue, of Biomimetic Therapeutics and Vanderbilt University Medical Center, gave a seminar course on presenting statistical data at a meeting of the American Statistical Associtation (ASA) in June 2008, and will be giving a similar course in April 2009 (as part of a continuing education program of the ASA). I learned of his course in a recent blog post at Statistical Modeling, Causal Inference, and Social Science.
The current version of Donahue’s text is a 100 pages [PDF], but well worth a casual read (it’s not as bad as it sounds, as a lot of those pages are dedicated to visual displays of the ideas he is describing). If you enjoy reading Tufte’s opinions on the topic of displaying data, and you have to create charts and diagrams of statistical data, then you should enjoy Donahue’s writing as well. Reading Tufte a couple of years ago had a tremendous impact on my view of visual displays. But the focus here is in on statistical data.
The two fundamental acts of science, description and comparison, are facilitated via models. By models, we refer to ideas and explanations that do two things: describe past observations and predict future outcomes. […] Statistical models, then, allow us to describe past observation and predict future within the confines of our understanding of probability and randomness. Statistical models become tools for understanding sources of variation.
A summary of some principles presented by Donahue:
- The exposition of the distribution is paramount.
- Show the atoms; show the data.
- Each datum gets one glob of ink.
- Erase non-data ink; eliminate redundant ink.
- Take time to document and explain.
- The data display is the model.
- Avoid arbitrary summarization, particularly across sources of variation.
- Reward the viewer’s investment in the data display.
- In viewing CDFs, steepness equals dataness.
- Plot cause versus effect.
- Typically, color ought be used for response variables, not design variables—but not always.
- We understand the individual responses by comparing them to a distribution of like individuals.
- Data presentation layouts and designs should be driven by intended use.
- Time series make fine accounting but poor scientific models.
Naturally Donahue was also influenced by Tufte. As he says, “the idea of analysis is to understand the whole by decomposing into component parts.” And he therefore reminds the reader of Tufte’s principles of analytical design:
- Show comparisons, contrasts, differences.
- Show causality, mechanism, structure, explanation.
- Show multivariate data; that is, show more than 1 or 2 variables.
- Completely integrate words, numbers, images, diagrams.
- Thoroughly describe the evidence.
- Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.
Just had a look: Donahue’s text is very interesting.