Hacker News new | ask | show | jobs
by ahi 5694 days ago
I predict a 10 year campaign of conquest followed by a 30 year death march. R is a complete mess that kicks ass in its niche. There are too many data types and the syntax seems kind of random, but two lines of R can get you publication quality graphics.

R is really becoming huge in academia. As far as I can tell, health sciences is the last SAS holdout. I expect it to take over business as well. Biz types will love it because it's so powerful as a scripting environment, but the programmers building and maintaining stuff with it will come to loathe it. R will become the PHP of analysis; ubiquitous but hated, and no one will have the chutzpah to fix it.

Random aside, anyone notice that the Kiwis are all over R? The original creators and the guy who wrote ggplot2 among many others.

4 comments

Yeah, in my experience, most biostatisticians (especially those involved in public health and clinical research) are SAS folks. Some of that is inertia- a lot of these people learned SAS at the same time they were learning stats. However, I think that most of SAS's continuing prevalence is due to the fact that, for all of its (many, many, many) problems, SAS is a freakin' log chipper when it comes to statistics- it doesn't care how much data you throw at it, or what kinds of crazy and/or exotic statistics you ask it for- if you can decipher its syntax, you can get it to do it.

Even for stuff that a lot of other programs can do just fine, SAS often has an edge. For example, everybody and their brother can do a logistic regression model... but SAS can give you confidence intervals for all kinds of crazy parts of the model that SPSS won't even bother calculating and that R will only give you point estimates for.

The other great thing about SAS is that a lot of the good statistics books from the last twenty or thirty years include SAS sample code- for example, I'm currently having to do some off-the-beaten-path ANOVA stuff, and the reference I'm using (Edwards' "Analysis of Variance for the Behavioral Sciences") uses SAS as its language of choice.

That said, I personally find the SAS "language" to be alternatively bewildering and nostalgia-inducing (the "cards" command, anybody?). SAS is the only language about which I can honestly say "it makes R's syntax look clean and predictable". Also, the Windows version of SAS is an absolute abomination from a UI standpoint. And, their licensing schemes are draconian, and installing the damn thing can easily take an entire day, especially if (say, for example) the installer gets confused because you've already got a JDK installed on your computer. Not that I'm bitter, or anything...

Of course, as others have noted, in bioinformatics, R either is already the default or is almost there. I know that in my department's bioinformatics courses, they use R, Python, and Perl almost exclusively, and only break out the SAS when there's something specific they need it to do.

I really don't get where this too many data types canard comes from - all you need to know is vector, matrix, array (1d, 2d, and nd homogeneous data types) and list and data frame (1d and 2d heterogeneous data types). On the other hand, the OO systems are somewhat bewildering.

I disagree that no one will have the chutzpah to fix R - I know of at least three groups including one driven by an extremely serious computer scientist, who are either working on rewrites of the internals or complete new implementations of the language. Even though R has been around longer than languages like Python and Ruby, it hasn't excited the interest of so many CS people, so it's at an early stage of it's evolution - it's only now at a point where serious alternative implementations of the core engine are starting to come out.

Personally, I've been working on making many of the core library more cleaner and more consistent. I'm completely biased, but I think if you use my packages (ggplot2 for graphics, plyr for apply functions, stringr for strings, lubridate for dates, ...) you'll have many fewer problems. And if you do find inconsistencies, I'm committed to fixing them.

Not sure about the rest of the field of health sciences, but everything in genetics is written in C or in R (and usually as C libs for R). The generation above me used SAS, but they're no longer writing code.
Yeah, Ross Ihaka just received the "Lifetime Achievement in Open Source Award" at the New Zealand Open Source Awards the other day.