Hacker News new | ask | show | jobs
by blahi 3556 days ago
That sounds like bad coders, not that R is bad.

Evidenced by:

>No threading to manage concurrency

R is used in production at EA, Activision, Ebay, Trulia, Google, Microsoft and many, many more. Those are just the ones I've seen give talks about scoring >1TBs regularly with R.

Every time somebody says R can't do be used for large data sets or is slow, I ask for more details and almost universally the programmer's complete lack of initiative is the weak link.

4 comments

That's definitely sounds like a bad coders. However I would say that if someone comes from other more classic programming language background he will write a bad and slow R code by default. Especially if he is pressured into delivering fast and don't have time to search/learn the best solution. I was amazed how often you can solve something with one or two lines in R and those 2 lines will have so much better performance, better readability, maintainability and reliability than something you would do without thinking. But you have to know those 2 lines and which libraries to use etc. R actually is extremely elegant language and probably most productive language if you know what you are doing however it's not very beginner friendly (especially coming from other languages).
R just does not have robust software engineering tools for anything that even begins to resemble scale and anybody who says otherwise is denying reality. R can certainly be used in production but the skeleton framework cannot be R. RPC only in my experience with all the structure with something else. R is intrinsically single user / batch with maybe shared database but say goodbye to anything that even starts to approach real time, or multi-node dependent. In my experience the only people who insist that R is robust for production, inevitably have a vested interest. Any objective programmer can see its greatness but also its glaring flaws.
Riiight. Everybody else is a bad engineer and you are the good one. With the single threaded R code...

edit: The comment above has been extended quite a bit. Initially it was a single (abrasive) sentence. I still stand by my answer however. Somebody who did not turn on multi-threading does not get to criticize R. It is the first thing you learn in any book about R. You have to be almost actively avoiding learning about it. It's in every 3rd blog post and SO question.

perhaps you might not have started your own comment with the erroneous view that 'bad coders' are to blame when R proves to be deficient at extra-design tasks.

Oh I further note your R consulting vocation. There you go. Vested interest.

BTW, I love R. But my love is not blind.

Excel is used in production very widely, but I'm sure we all agree it has its limitations.
Do you have personal experience with this kind of hyper-performant R code?
I have experience scoring ~ 1TB daily. And a lot of smaller data sets spanning a few hundred gigs.

It's not "hyper performant". Obviously doing things in scala or C++ will be faster. However rewriting the models would take months and an entirely different set of skills. That means separate people.

But if somebody says that they use Python instead of R for the speed... that's just bull. For example one of the fundamental building blocks, pandas is slower than the counterpart in R.

this is not software engineering or production. It is batch jobs / exploratory analysis. It requires little or no structure apart from the analysis itself.

also in anything that has not been coded in C directly underneath, Python is 20x faster and C is 500× faster. R is literally the slowest mainstream language today by a long shot. That's a key consideration for production.

Where did you get those numbers from? They are most definitely wrong unless you don't vectorize your code and run loops all around. A lot of R is actually written in C so you can squeeze really good performance if you know what you are doing. I would recommend reading Hadley's Advanced R and profile your code, I think you might be pleasantly surprised.
I make extensive use of vectorization and use as many calls as I possibly can to the built-ins and/or c-based libraries. However as you well know, part of the fun in R is applying your own functions and unless you write these in C, you're back to native R and that's tediously slow. Ggplot another culprit -> amazing library, but if you're chucking out large amounts of custom charts with it it takes ages. Base graphics an order of magnitude faster (if less pretty and convenient for axis training).
I would also suggest The R Inferno.
could you talk about some of the learnings you had around scoring 1tb daily in R ?

How do you even load the data into memory ? is it read from a database or s3 files.

In that particular case, I used Vertica which loads data in R really, really fast and straight up use a very big machine.

That's not how I approach it most of the time though. I mostly use out-of-memory algorithms, sometimes open source, sometimes Revolution's (now Microsoft). They process things in chunks. You can see BigLM and SpeedGLM for quick examples. h2o is also very popular platform. You should probably check the High Performance Comptuing CRAN Task View.

I have also used Netezza and Hana and both worked well for the purpose. There's also Teradata Aster but I don't have experience with it. There's also the open-source MonetDB which has in-database R threads and also an r package similar to rsqlite.

There are also map/reduce packages for Hadoop.

I never would have tried MonetDb (ok monetdblite) if not for this great little tutorial on how to load all of SEER into it:

http://www.asdfree.com/2013/07/analyze-surveillance-epidemio...

Yeah the presentation and code isn't beautiful, but it does avoid the need to WRITE THE DAMNED THING YOURSELF, which some people apparently will never understand (although they will once they are unemployed). More importantly, it turns out you don't necessarily need Vertica for fast out-of-core loading and processing.

Granted, there are plenty of other ways to work out of core (hdf5, bigMatrix, any random database, blah blah) but this was one that was new to me. And I like it.