Hacker News new | ask | show | jobs
by jeremystan 3778 days ago
We use R in production in two ways:

1. For batch processes that run daily, hourly or minutely, where the models are rebuilt on every run, and outputs (often predictions) are written to a database 2. For computation of coefficients in large sparse regularized models, where the coefficients are written to a database and scoring is done in another language in real-time

For situations where we want real-time predictions, recommendations or optimizations, we tend to setup Python services instead. For batch processes, you can definitely store models in S3 to re-use them, and I've done that at other companies. But in general I've found it better to rebuild models frequently and cache them for short periods of time only if they are cost-prohibitive to rebuild.

1 comments

@jeremy - that's helpful! Could you hint at the way you persist sparse models to the DB in R. Especially if you are changing your variables pretty frequently. do you use something like Postgres JSONB (which is funky in R).

Also about scoring in another language - is this really worthwhile for you ? I have often debated just throwing 128GB of RAM on an R machine and calling it a day. As I figure, your "real time" requirements are probably seconds or even minutes (similar to mine).

To persist sparse models to the DB, especially if you use L1 regularization (like Lasso) then many coefficients will be 0, and don't need to be stored or processed. Insofar as you store coefficients and features in a "tall" format (e.g., user, feature_key, feature_value), then space is conserved. Scoring can be done in DB with joins and group by, or in another language with similar operations.

Changing variables frequently can be versioned in the feature and model coefficients tables, but takes care.

I haven't used Postgres JSONB, but if you have problems with JSON in R check out the tidyjson package (I wrote when dealing with Mongo data previously).

Scoring in another language is best avoided if you can. But supporting R "real time" services will also come with many complications. Hence, we use Python when we really want that.

SparkR was completely unreliable when I first tested it over a year ago, but may have improved. Though the Spark Python API has some limitations compared to Scala, so I would guess the latest SparkR is even further behind, but we haven't tested it. Long term I'd love for that to be the answer to these questions.