Hacker News new | ask | show | jobs
by blahi 3556 days ago
I have experience scoring ~ 1TB daily. And a lot of smaller data sets spanning a few hundred gigs.

It's not "hyper performant". Obviously doing things in scala or C++ will be faster. However rewriting the models would take months and an entirely different set of skills. That means separate people.

But if somebody says that they use Python instead of R for the speed... that's just bull. For example one of the fundamental building blocks, pandas is slower than the counterpart in R.

2 comments

this is not software engineering or production. It is batch jobs / exploratory analysis. It requires little or no structure apart from the analysis itself.

also in anything that has not been coded in C directly underneath, Python is 20x faster and C is 500× faster. R is literally the slowest mainstream language today by a long shot. That's a key consideration for production.

Where did you get those numbers from? They are most definitely wrong unless you don't vectorize your code and run loops all around. A lot of R is actually written in C so you can squeeze really good performance if you know what you are doing. I would recommend reading Hadley's Advanced R and profile your code, I think you might be pleasantly surprised.
I make extensive use of vectorization and use as many calls as I possibly can to the built-ins and/or c-based libraries. However as you well know, part of the fun in R is applying your own functions and unless you write these in C, you're back to native R and that's tediously slow. Ggplot another culprit -> amazing library, but if you're chucking out large amounts of custom charts with it it takes ages. Base graphics an order of magnitude faster (if less pretty and convenient for axis training).
I would also suggest The R Inferno.
could you talk about some of the learnings you had around scoring 1tb daily in R ?

How do you even load the data into memory ? is it read from a database or s3 files.

In that particular case, I used Vertica which loads data in R really, really fast and straight up use a very big machine.

That's not how I approach it most of the time though. I mostly use out-of-memory algorithms, sometimes open source, sometimes Revolution's (now Microsoft). They process things in chunks. You can see BigLM and SpeedGLM for quick examples. h2o is also very popular platform. You should probably check the High Performance Comptuing CRAN Task View.

I have also used Netezza and Hana and both worked well for the purpose. There's also Teradata Aster but I don't have experience with it. There's also the open-source MonetDB which has in-database R threads and also an r package similar to rsqlite.

There are also map/reduce packages for Hadoop.

I never would have tried MonetDb (ok monetdblite) if not for this great little tutorial on how to load all of SEER into it:

http://www.asdfree.com/2013/07/analyze-surveillance-epidemio...

Yeah the presentation and code isn't beautiful, but it does avoid the need to WRITE THE DAMNED THING YOURSELF, which some people apparently will never understand (although they will once they are unemployed). More importantly, it turns out you don't necessarily need Vertica for fast out-of-core loading and processing.

Granted, there are plenty of other ways to work out of core (hdf5, bigMatrix, any random database, blah blah) but this was one that was new to me. And I like it.