Hacker News new | ask | show | jobs
by YeGoblynQueenne 3778 days ago
>> But if someone prefers iterative solutions, or that's all they know, why can't R make them just as fast as the vectorised versions?

R is interpreted and dynamically typed, so when you declare a variable, the interpreter has to do some bookkeeping to figure out the type of the variable, allocate memory for it and so on.

If you write a loop by hand, the interpreter has to do this bookkeeping once for each iteration.

If you write your code in vectorised form, the interpreter can sort out the bookkeeping once and then hand over to the lower-level code (C or Fortran) the vectorised functions are interpreted in.

This can also be further optimised to take advantage of processor vector instructions, parallel processing etc.

So I'm afraid we can't have our pie and eat it. If we want an interpreted language with somewhat intuitive notation, then it has to have crappy slow loops. If we want a language with fast loops we have to rely on C or Fortran and forget about vectorised notation.

3 comments

> If we want an interpreted language with somewhat intuitive notation, then it has to have crappy slow loops.

Unless you're Julia, JavaScript or Lua with a fiendishly clever virtual machine. Look at the benchmark figure here: http://julialang.org/

Why can't a JIT solve this? It shouldn't need to do the bookkeeping for every iteration if it has JIT compiled it. A JIT should be able to take advantage of processor vector instructions etc.
There's some movement in that direction.

However, the R core committers are essentially not only volunteers, but they're all (afaik) academic statisticians. One of the people who made strides in this direction is primarily an computational statistician at Iowa (Luke Tierney / compiler package). Building a high performance runtime/jit is wildly out of their scope of expertise.

In retrospect, and I think many of them would agree, building and maintaining their own runtime was a giant mistake. Yet here we are.

Serious compiler people (Jan Vitek, others) have made strides towards a faster implementation (his in java / fastr IIRC), but it suffers from the same problem as cpython: there are millions of lines of C code in packages or internal functions that have the details of the R interpreter / C interface deeply embedded in them. In fact, there's probably far more "R" code written in C than in R. Undoing this mess is not easy, and probably not possible.

Oh, reading Evaluating the Design of the R Language [1] will shed some more light on why it's hard to make R run fast.

[1] http://r.cs.purdue.edu/pub/ecoop12.pdf

edited to correctly describe Luke as per gbrown

I think, and I'm pretty sure most of R core would agree, that building and maintaining their own runtime _was_ the right thing to do. Otherwise R would have been at mercy of maintainers who were interested in problems other than creating an expressive language for data analysis.
I don't think calling Luke an "agricultural statistician" is at all reflective of his work. Not everything in Iowa is corn, and Luke has been working in computationally intensive statistical methodology and statistical software development for decades.
He created lisp-stat in the late 80's

https://www.jstatsoft.org/article/view/v013i09

"While R and Lisp are internally very similar, in places where they differ the design choices of Lisp are in many cases superior. The difficulty of predicting performance and hence writing code that is guaranteed to be efficient in problems with larger data sets is an issue that R will need to come to grips with, and it is not likely that this can happen without some significant design changes."

Hmm, you're quite right; I'm not sure how I came to believe that.
R does actually ship with the ability to byte-compile functions these days, and as that functionality matures it may become the default behavior. It's still better to actually learn the language; it's far easier to optimize something like:

    apply(X, 1, function(x){
        # do stuff to the row of X
    })
than:

    for (i in 1:nrow(X)){
        # do stuff to X[i,], and store it somewhere
    }
As far as I know byte-compiling won’t actually alleviate the repeated name lookup (or does it?). Unless the R byte compiler is fiendishly clever, every single name lookup in the loop body will still incur what essentially amounts to a `get(name, environment(), inherits = TRUE)` call.
Probably not, but I'll admit to not having dug into it too deeply. In my initial experiments, I found only modest speed gains when byte compiling. Then again, I'm already using C functions wherever possible.
> If we want a language with fast loops we have to rely on C or Fortran and forget about vectorised notation.

Fortran (Fortran 90 specifically) got vector notation 20 years ago.

I suspected this might be the case but I don't know Fortran. Maybe you're right about Julia and Lua also, I'll have to investigate.