Hacker News new | ask | show | jobs
by sandGorgon 3556 days ago
could you talk about some of the learnings you had around scoring 1tb daily in R ?

How do you even load the data into memory ? is it read from a database or s3 files.

1 comments

In that particular case, I used Vertica which loads data in R really, really fast and straight up use a very big machine.

That's not how I approach it most of the time though. I mostly use out-of-memory algorithms, sometimes open source, sometimes Revolution's (now Microsoft). They process things in chunks. You can see BigLM and SpeedGLM for quick examples. h2o is also very popular platform. You should probably check the High Performance Comptuing CRAN Task View.

I have also used Netezza and Hana and both worked well for the purpose. There's also Teradata Aster but I don't have experience with it. There's also the open-source MonetDB which has in-database R threads and also an r package similar to rsqlite.

There are also map/reduce packages for Hadoop.

I never would have tried MonetDb (ok monetdblite) if not for this great little tutorial on how to load all of SEER into it:

http://www.asdfree.com/2013/07/analyze-surveillance-epidemio...

Yeah the presentation and code isn't beautiful, but it does avoid the need to WRITE THE DAMNED THING YOURSELF, which some people apparently will never understand (although they will once they are unemployed). More importantly, it turns out you don't necessarily need Vertica for fast out-of-core loading and processing.

Granted, there are plenty of other ways to work out of core (hdf5, bigMatrix, any random database, blah blah) but this was one that was new to me. And I like it.