# Luk Arbuckle

## Least squares explained simply, and in video

In regression on 13 November 2008 at 5:16 pm

A short five minute video has been created explaining least squares with JMP.  The author, Lee Creighton, uses a very simple example of fitting a line to data, and considers different measures of a “best” fit (not to ruin the punch line, but least squares has a global minimum).  On the right hand side of the applet is an error bar (for the particular measure being considered) that changes as the line is changed for the set of data points.

Least squares considered with, well, squares!

What I like about the visualization is the that the squared errors are displayed as actual squares.  It seems silly, but I had never thought of squared errors in a literal, geometric way.  I always thought of them as scalars, and nothing more.  But showing actual squares makes for a much better visual representation.

It’s likely we’ll see more video tutorials from Lee Creighton on vimeo, as he’s added a few new items already (although there’s been no mention of these at the JMP blog, at least not yet).  Of course, this assumes people find them useful and he receives some positive feedback (and I hope he does, as video tutorials in statistics are rare).

Note that I chose not to embed the video into my blog post because I have no control over the formatting, and to ensure Creighton gets clicks to his blog post (give credit where credit is due).  Click on the image to get to the video.

## You have to learn statistical programming

In programming on 17 September 2008 at 2:43 pm

A friend told me that the most valuable job skill he had acquired (at school and on the job) was statistical programming.  I’m not surprised.  Proving theorems is mostly left to academics.  In industry, you need to be able to put that statistical knowledge to practical use, and that means you need to know how to code.  What follows is a list of three main statistical programming environments (in alphabetical order) along with reasons for learning them and other comments:

• R and S-PLUS:  R is a favourite in university stats departments.  Besides being freely available, it is also a powerful statistical programming language based on S.  The “professional” version (of S)—which includes database connectivity, optimized code, advanced graphics, customer support, etc.—is S-PLUS.  The advantage and appeal of R and S-PLUS is the base programming language (which is clean and modern, but requires more programming knowledge then the languages that follow).
• SAS: Type “SAS” into any job search engine and you quickly realize the importance of learning this statistical programming language.  SAS is heavy on procedures that hide the technical details of what is going on.  It’s strong on the day-to-day work of data prep and analysis, but less strong on the coding of new functions.  The language feels old compared to R and S-PLUS, but it is the dominant statistical programming environment in industry.
• SPSS: You’ll find one job listing that mentions SPSS for every five for SAS.  Its strength is in the graphical user interface.  Lots of drag and drop.  In other words the focus is much less on coding than it is on ease of use (for someone that isn’t coding everyday, but still needs a statistical environment for running analyses).  My first impression was that the programming language for SPSS is even more old and cumbersome than SAS.  But an extension allows you to run Python and R.

All of the above have a wealth of functions and tools.  Budding statisticians need to learn SAS given the market penetration, but there’s no excuse for not learning R as well (since it’s free and very powerful).  I’m using R for time series analysis (since my research will be in the area of frequency-domain methods of time series analysis).  S-PLUS and SPSS have tools for bridging, running, and even compiling R code for use in their respective environments, although I can’t find anything like that in SAS.

I need to advance my knowledge of SAS but for now I’m learning SPSS since it runs on my Mac (unlike SAS), and because I live in a government town—an area where SPSS has good market penetration (at least in some disciplines).   I’m excited to learn that Python and R can be used with SPSS (in part because the programming language bundled with SPSS does not appeal to me, but also because I like the Python and R programming languages).  I’m not sure when I’ll have a chance to dig into this further, but it’s encouraging to know that this functionality is available.

## Irrational fear of non-normality

In models on 6 July 2008 at 9:42 pm

What do you do if your model errors are not normally distributed?  If you intend to use statistical procedures that assume normally distributed residuals, you may think of “agonizing over normal probability plots and tests of residuals”.  Some leaders at the JMP division of SAS, however, think it might be a waste of time.

The central limit theorem assures us that even if the data are not normal, mean-like statistics still approach normal distributions as the sample size increases. With small samples, these statistics may not be nearly normal, but we don’t have a big enough sample to tell.

They don’t say to drop the use of tests of normality and normal probability plots of residuals, as they have their place.  But their simulations suggest that these tests are unnecessary in most cases (see their article on page 9 of the SPES/Q&P Newsletter for the details).  In genral, they

recommend plotting residual values versus predicted values, by case order, or versus other variables.  Rather than distributional testing, look for graphical anomalies, especially outliers or patterns that might be a clue to some hidden structure.

Although this is good advice, you may not get buy-in with everyone you work with.  No analyst wants to be in a position of having their work questioned when assumptions are found to have been violated.  Sometimes it’s easier to just do what is expected, or demanded, although that’s never been my style—better keep this one in my back pocket, just in case.

## Confidence, prediction, and tolerance intervals explained

In estimation on 25 May 2008 at 10:00 am

JMP, a business division of SAS, has a short seven page white paper that describes the differences between confidence, prediction, and tolerance intervals using a simple manufacturing example. Formulas are provided along with instructions for using JMP menus to calculate the interval types from a data set.

Statistical intervals help us to quantify the uncertainty surrounding the estimates that we calculate from our data, such as the mean and standard deviation. The three types of intervals presented here—confidence, prediction and tolerance—are particularly relevant for applications found in science and engineering because they allow us to make very practical claims about our sampled data.

 Related posts: That confidence interval is a random variable No one understands error bars

It’s not an eye-opening read per se, but it’s nonetheless important to understand the nuances between the different interval types. The table provided at the end, with an interpretation of each interval type for the example provided, is a good summary of the ideas presented.