Luk Arbuckle

Posts Tagged ‘media’

The end of theory?

In models on 30 June 2008 at 10:15 pm

The growth in the amount of data we can store and access for scientific analysis is creating new opportunities for discovery. It is also creating opportunities for the development of new statistical methods and techniques. And, as professes Mike Anderson at Wired Magazine, it will make the scientific method obsolete.

Anderson starts with a quote from statistician George Box to focus on the idea that “all models are wrong”. Prior to the deluge of data, we were limited to the idea that “some are useful”. But now we can take the example set forth by Google and mine vast amounts of data to look for patterns in science. “Correlation supersedes causation”, states Anderson in his concluding remarks—science can move forward without theory or models.

He proposes we consider a revised quote, put forth by a research director at Google: “All models are wrong, and increasingly you can succeed without them.” This quote, however, is about success in business, not science. It’s the difference between engineering and science. Models in science represent our understanding of a process or system. They’re not just there to get us an answer. The goal of science is to understand; the goal of engineering is to solve a problem.

The devil is in the details
Then there are the technical arguments. The algorithms that Anderson describes in pushing forward his ideas have constraints in the underlying statistical models that come from scientific theory. And data mining is no panacea for all research problems—the relevant probability theory requires constraints that cannot be overcome by blind faith in all things Bayesian.

Even if you accept that data mining algorithms require some constraints based on assumptions of some kind, I’m not convinced that they could achieve the level of accuracy required to kill the scientific method. In the domain of military intelligence, of which I’ve had some exposure over the last couple of years, a good model can pick out a needle in a haystack. And a good model depends on accurate theory.

Although Anderson loves to highlight the success of Google, the truth is search is far from perfect. The “semantic web” is touted as being the next big thing to advance the science of search, among other things, but is based on theory as well as data. Intelligent search is not about crunching more data, it’s about understanding information and reasoning.

Statistical algorithms that are being used to “find patterns where science cannot” are actually part of the scientific method. They are tools to help advance science so we develop a better understanding of our world. They will be used to develop and refine theory. And at least one thing is certain: Anderson has succeeded in creating a lot of buzz by putting forward a controversial idea.

How to lie with statistics

In lies on 22 June 2008 at 11:15 pm

How to lie with statistics—I couldn’t ask for a better title to a post on my blog. It is, however, the title of a book that came long before me or my blog. And it is one of the few books about statistics that does not use equations. At a slim 142 pages it goes a long way towards educating the reader about the tricks that are used to “sensationalize, inflate, confuse, and oversimplify”, as the author writes in the introduction.

A few years ago professor J. Michael Steele published a short commemorative article, for an introduction to a special section of the journal Statistical Science, to Darrell Huff and Fifty Years of How to Lie with Statistics.

Many statisticians are uncomfortable with Huff ’s title. We spend much of our lives trying to persuade others of the importance and integrity of statistical analysis, and we are naturally uncomfortable with the suggestion that statistics can be used to craft an intentional lie. Nevertheless, the suggestion is valid.

Steele gives a short biography of Huff, including some of his published works, before jumping into a detailed discussion of the book. He describes the reasons he believes the book has been successful all these years—although he never mentions the obvious lack of equations—and the contents:

  • The first four chapters cover introductory stats (that no one remembers a year after their first stats course),
  • the next three cover graphs (“the most original in the book”, says Steele),
  • a chapter on cause-and-effect,
  • another that argues that if a persons seems to by lying with stats then they probably are,
  • and lastly a chapter on critical thinking.

For me the important discoveries are: that there exists a book about stats without equations and with a provocative name (sounds like a fun read), and a bunch of related articles that I can use to write about in future posts (also fun reads, but more technical).