| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chimeracoder 5190 days ago

I have to come to love R (for what I use it it for), but reading this makes me realize how unusual my R-workflow must be, because most of the 'advantages' of Julia over R don't really come up in my daily workflow anymore - it seems that's likely because I've adapted to the shortcomings of R and have twisted other tools to my needs. I'll add Julia to my list of languages to check out in more detail, because perhaps Julia could replace my need for this rather esoteric workflow that I've developed out of sheer necessity.

I use Python (NumPy/SciPy) for most of the data preprocessing, and perhaps that's why. I used to do this in R, and I realized that it's just a lot easier to get done in Python (and it ends up being faster anyway). The problem is that Python/NumPy/SciPy still doesn't lend itself quite as well as R does to certain aspects of the statistician's use case. It's possible that things have changed since the last time I evaluated the two, but I still find it easier to prototype various models in R, even if I do all of the preqrequisite data munging in a different environment.

I understand that R, like Perl, is 'blessed' (pun intended) with two different, incompatible type systems - in fact, this is the reason I avoid using R's type system, and whenever I'm advising newcomers, I always recommend the same. I don't write statistical packages, so this doesn't come up, but when I find myself needing to write a method in R, I ask myself if this would actually be done more easily another way instead. Generally, I find the answer is 'yes, yes it would'.

I really do think the problem is the type system. The kind of type system that lends itself well to data manipulation is not the same type system that lends itself well to model manipulation - when I think about it, I've unconsciously segregated my workflow into two parts, doing everything naturally done with Python's type system in Python, and likewise for R. Maybe that's just the way that I happen to approach data manipulation, but I think it's non-coincidental. R's relative homoiconicity (compared to Python) makes it really nice for some things, but there are other warts with its typing that are just too annoying to work around, when a python shell is just a few keystrokes away.

I guess the answer is (as always!) to use a purely homoiconic Lisp dialect, so you get the best of both worlds but that's asking a lot of statisticians.

I really have come love R for what it does do, though. Of all all the statistical software packages I've seen (comparable: SAS, SPSS, Stata, MATLAB), it's far and away the best (and the GNU license makes it very, very attractive to broke students looking to avoid the still-absurdly-priced student licenses for the alternatives). That said, I still sigh every time I realize that I'm essentially gluing together two separate runtime environments for something that should really be easily integrated. I do what I do now because it ends up being faster than using either Python or R for everything, but it still strikes me as weird that a language so perfect for munging data (Python) can still be so awkward for analyzing it, and vice versa.

7 comments

necubi 5190 days ago

I find that I do the same thing, except with Ruby for data processing instead of Python. It may be that I just don't know R all that well, but there are so many tasks that are incredibly awkward in R, often requiring a third-party library like plyr which are easily expressed in a language with more "normal" semantics.

An example, from this week: I have a bunch of CSV data files from various trials of an experiment. I want to combine them into one data frame with a new column that includes an id for trial. This took me about a half hour to figure out in R, and five minutes to write in Ruby.

I think the main problem with R is that there's a different way to do everything. It feels like a language that was not so much designed as gradually evolved. In a functional-ish language like Ruby or Python you have a few workhorse data manipulation tools: map, fold, etc. But in R everything is different depending on whether you're dealing with row vectors, column vectors, data frame, or arrays. It makes it hard to generalize over slightly different problems to find common solutions.

Julia looks really awesome, though, and I'm excited to see something that might be able to replace R and bring all of this comfortably into one language.