Luk Arbuckle

Posts Tagged ‘models’

Irrational fear of non-normality

In models on 6 July 2008 at 9:42 pm

What do you do if your model errors are not normally distributed?  If you intend to use statistical procedures that assume normally distributed residuals, you may think of “agonizing over normal probability plots and tests of residuals”.  Some leaders at the JMP division of SAS, however, think it might be a waste of time.

The central limit theorem assures us that even if the data are not normal, mean-like statistics still approach normal distributions as the sample size increases. With small samples, these statistics may not be nearly normal, but we don’t have a big enough sample to tell. 

They don’t say to drop the use of tests of normality and normal probability plots of residuals, as they have their place.  But their simulations suggest that these tests are unnecessary in most cases (see their article on page 9 of the SPES/Q&P Newsletter for the details).  In genral, they

recommend plotting residual values versus predicted values, by case order, or versus other variables.  Rather than distributional testing, look for graphical anomalies, especially outliers or patterns that might be a clue to some hidden structure. 

Although this is good advice, you may not get buy-in with everyone you work with.  No analyst wants to be in a position of having their work questioned when assumptions are found to have been violated.  Sometimes it’s easier to just do what is expected, or demanded, although that’s never been my style—better keep this one in my back pocket, just in case.

The end of theory?

In models on 30 June 2008 at 10:15 pm

The growth in the amount of data we can store and access for scientific analysis is creating new opportunities for discovery. It is also creating opportunities for the development of new statistical methods and techniques. And, as professes Mike Anderson at Wired Magazine, it will make the scientific method obsolete.

Anderson starts with a quote from statistician George Box to focus on the idea that “all models are wrong”. Prior to the deluge of data, we were limited to the idea that “some are useful”. But now we can take the example set forth by Google and mine vast amounts of data to look for patterns in science. “Correlation supersedes causation”, states Anderson in his concluding remarks—science can move forward without theory or models.

He proposes we consider a revised quote, put forth by a research director at Google: “All models are wrong, and increasingly you can succeed without them.” This quote, however, is about success in business, not science. It’s the difference between engineering and science. Models in science represent our understanding of a process or system. They’re not just there to get us an answer. The goal of science is to understand; the goal of engineering is to solve a problem.

The devil is in the details
Then there are the technical arguments. The algorithms that Anderson describes in pushing forward his ideas have constraints in the underlying statistical models that come from scientific theory. And data mining is no panacea for all research problems—the relevant probability theory requires constraints that cannot be overcome by blind faith in all things Bayesian.

Even if you accept that data mining algorithms require some constraints based on assumptions of some kind, I’m not convinced that they could achieve the level of accuracy required to kill the scientific method. In the domain of military intelligence, of which I’ve had some exposure over the last couple of years, a good model can pick out a needle in a haystack. And a good model depends on accurate theory.

Although Anderson loves to highlight the success of Google, the truth is search is far from perfect. The “semantic web” is touted as being the next big thing to advance the science of search, among other things, but is based on theory as well as data. Intelligent search is not about crunching more data, it’s about understanding information and reasoning.

Statistical algorithms that are being used to “find patterns where science cannot” are actually part of the scientific method. They are tools to help advance science so we develop a better understanding of our world. They will be used to develop and refine theory. And at least one thing is certain: Anderson has succeeded in creating a lot of buzz by putting forward a controversial idea.

All models are wrong, but some are useful?

In models on 14 June 2008 at 11:02 am

Industrial statistician George Box is credited for the saying that “all models are wrong, but some are useful.” Andrew Gelman, professor of statistics at Columbia, shares some thoughts on the saying:

With a small sample size, you won’t be able to reject even a silly model, and with a huge sample size, you’ll be able to reject any statistical model you might possibly want to use.

I wasn’t familiar with this quote, but the discussion is interesting. I think the following comment made by Gelman captures the point of the quote well (although, admittedly, Gelman is specifically talking about model checking, but this distilled version works):

[The point is] to understand what aspects of the data are captured by the model and what aspects are not.

Gelman’s comments come after reading a rant (or personal essay) by professor of statistics J. Michael Steele at Wharton. I particularly like Steele’s discussion of “fitness for purpose”. As he says:

The majority of published statistical methods hunger for one honest example.

Steele highlights the shortcomings of model adequacy and provides links to a couple of short notes that take the discussion further. Does the model make sense? and Does the model make sense? Part II: Exploiting sufficiency. At this point the discussion can explode into articles and chapters from Gelman and Buja, which Steele refers to directly, among others.

I find Steele to be a very practical, down-to-earth sort. I’m going to have to keep an eye on his writing. I only wish he had a blog we could follow.

Econometrics lit review in video

In mixed on 27 May 2008 at 12:45 am

The National Bureau of Economic Research—a private, nonprofit, nonpartisan research organization—has made public an eighteen-hour workshop from it’s Summer Institute 2007: What’s New in Econometrics?  Included are lecture videos, notes, and slides from the series.

The lectures cover recent advances in econometrics and statistics.   The topics include (in the order presented):

  • Estimation of Average Treatment Effects Under Unconfoundedness 
  • Linear Panel Data Models
  • Regression Discontinuity Designs
  • Nonlinear Panel Data Models
  • Instrumental Variables with Treatment Effect Heterogeneity: Local Average Treatment Effects
  • Control Function and Related Methods
  • Bayesian Inference
  • Cluster and Stratified Sampling
  • Partial Identification
  • Difference-in-Differences Estimation
  • Discrete Choice Models
  • Missing Data
  • Weak Instruments and Many Instruments
  • Quantile Methods
  • Generalized Method of Moments and Empirical Likelihood

The speakers explain the material well, including some practical pros and cons to the methods presented.  The slides are, however, typically academic: packed with content and equations, with little to support the speaker.  In a way it’s expected, but surprising given that lecture notes are provided.

It takes a bit of time to get into the talks, but once you do there’s lots to learn.  I suggest two open browser windows: one for the videos, one for the slides.  But avoid the temptation to read the slides—the speakers explain the material well and you’ll pick up quite a bit if you can focus on what they’re saying while you stare lovingly at the equations.

Special thanks to John Graves at the Social Science Statistics Blog for posting a notice about the series.