A friend told me that the most valuable job skill he had acquired (at school and on the job) was statistical programming. I’m not surprised. Proving theorems is mostly left to academics. In industry, you need to be able to put that statistical knowledge to practical use, and that means you need to know how to code. What follows is a list of three main statistical programming environments (in alphabetical order) along with reasons for learning them and other comments:

- R and S-PLUS: R is a favourite in university stats departments. Besides being freely available, it is also a powerful statistical programming language based on S. The “professional” version (of S)—which includes database connectivity, optimized code, advanced graphics, customer support, etc.—is S-PLUS. The advantage and appeal of R and S-PLUS is the base programming language (which is clean and modern, but requires more programming knowledge then the languages that follow).
- SAS: Type “SAS” into any job search engine and you quickly realize the importance of learning this statistical programming language. SAS is heavy on procedures that hide the technical details of what is going on. It’s strong on the day-to-day work of data prep and analysis, but less strong on the coding of new functions. The language feels old compared to R and S-PLUS, but it is the dominant statistical programming environment in industry.
- SPSS: You’ll find one job listing that mentions SPSS for every five for SAS. Its strength is in the graphical user interface. Lots of drag and drop. In other words the focus is much less on coding than it is on ease of use (for someone that isn’t coding everyday, but still needs a statistical environment for running analyses). My first impression was that the programming language for SPSS is even more old and cumbersome than SAS. But an extension allows you to run Python and R.

All of the above have a wealth of functions and tools. Budding statisticians need to learn SAS given the market penetration, but there’s no excuse for not learning R as well (since it’s free and very powerful). I’m using R for time series analysis (since my research will be in the area of frequency-domain methods of time series analysis). S-PLUS and SPSS have tools for bridging, running, and even compiling R code for use in their respective environments, although I can’t find anything like that in SAS.

I need to advance my knowledge of SAS but for now I’m learning SPSS since it runs on my Mac (unlike SAS), and because I live in a government town—an area where SPSS has good market penetration (at least in some disciplines). I’m excited to learn that Python and R can be used with SPSS (in part because the programming language bundled with SPSS does not appeal to me, but also because I like the Python and R programming languages). I’m not sure when I’ll have a chance to dig into this further, but it’s encouraging to know that this functionality is available.

I agree completely. Knowing any one of these well is a great marketable skill to have. Knowing two (or all three) is even better, because none of them does every statistic. Researchers who know only one statistics package can get stuck when they suddenly need to use a complicated statistical method that isn’t available in their stats package. Then they have to learn a new stats package and a difficult statistical method at the same time.

24 September 2008at9am