Hacker News new | ask | show | jobs
by apohn 3556 days ago
I used to work in the consulting arm of a software firm and we wrote and deployed R code in production at many Fortune 500 companies. We worked in almost every industry.

I spent quite a bit of time refactoring bad R code so it could run reliably in a production environment. There is a ton of bad R code out there that barely works for exploratory analysis, let alone a production environment.

So yes, R is used in production environment in a lot of places.

3 comments

Did you guys separate out the R process (or multiple processes?) from the rest of the transaction-processing / other server infrastructure or embed the REngine (which sounds like a bad idea to me; incorrect data serialization can easily crash the whole process)?

What is a stable way to connect (and reconnect!) to R, assuming it was a separate process? I would think that an indirect communication path, such as Server <--> Database <--> R would work best, but I'd love to hear your battle hardened take on it.

We used separate workflows depending on if the data was streaming or batch oriented (e.g. on-demand or triggered by a user). First I'll talk about batch oriented jobs.

The company I worked for had a tomcat based product that exposed R via a RESTful API. It was similar to what you get from AzureML now, except it was on-premise. So basically we would call out to this and configure it to restart R sessions if they crashed or timed out.

In an ideal situation we would isolate this server from the rest of the processing as much as possible. To be honest our server was pretty basic - it basically served to queue jobs (if needed) and manage RSessions if the server was configured to run multiple sessions. For serious failover we had a second server.

We did try to do as much as possible outside of R such as data pipelining an ETL. That was done for the obvious reasons, but also because many customers had SQL and Data people, but not R people. So if one of their Data people understood the data ETL, they could fix it without calling us.

For many customers they'd never let R connect to a Database directly. So They'd have a separate process pull data and write it to disk. Then an R script would be triggered and would pick this data up.

I never saw major crashing issues with R in production with batch oriented jobs unless there was something unexpected with the size or type of data. Typically as long as there was time between jobs, R's garbage collector would sort things out and be ready for the next job. Also by the time something made it into production we'd hardened the script, frozen the CRAN package versions, etc. So some small issue wouldn't cause a major issue.

Streaming data presented it's own adventure. To get data into/out of R as quickly as possible, you need to embed the REngine and talk to it via rJava. If we streamed data through R very quickly it would do fine for a while - then you'd see the memory usage go up and the time for each transaction started to vary greatly. Then it would crash.

The solution to this was multiple Rsessions and a lot of telemetry. We would track how long each transaction took through R. As soon as we started seeing a lot of variance in the time we'd restart the engine. By running the multiple Rsessions in round-robin we'd delay the onset of this instability, and it didn't matter when R sessions needed to be restarted.

Another trick we used was to cache data in an in-memory database so if something crashed the whole service would restart and pull from the in-memory database instead of trying to fetch old data from the server.

Thanks, this is all quite useful! I faced crashes with REngine + rJava, too, and thought of a DB as a intermediary, but your in-memory DB idea adds an interesting twist that adds performance, too.
i have the same question - how do you use R in production ?
R and its libraries are GPL licensed. Is there some corporate license available to prevent companies from being required to publish proprietary code that interacts with R? Or was the usage limited to to internal systems?
Thanks to certain popular technologies like Hadoop, a lot of big companies have their legal teams looking at open source licenses as an alternatives to the big vendors like IBM. Using R and CRAN is getting easier because of this.

A lot of customers we worked with only provided outputs to external parties via reports, extracts, dashboards, etc. I don't recall a situation where an external person could run an R script (e.g. some of the companies I worked for provided their customers with BI reports). Don't ask me about the legality of that - even if I had an answer I wouldn't say it.

We used to run into all sorts of annoying issues with regards to licensing. For example, I worked at a customer where their scientists were blocked from downloading stuff from CRAN in an ad-hoc way (e.g. install.packages()). And nobody from out team was allowed to send them packages due to fear that they'd blame us for any issues with packages or package licensing.

The end result was a convoluted process for installing R, upgrading R, or anything to do with packages. During one project I was involved in a ridiculously long winded email chain discussing licensing on a particular library, with the lawyers acting like I had some sort of insight into the mind of the library author. That's the kind of resistance some organizations face when thinking about open-source tools.

What are the hallmarks of bad R codes to watch out for and avoid?
What makes R code good for production is basically the same for what makes code in any language good for production. Use functions, local variables, tests, check for nulls, type issues, etc.

I wouldn't say having loops is always a bad thing. Sometimes writing loops is the only way to solve a particular problem and code loops can be easier to read and debug. Sometimes people say use the apply family of functions instead of loops, but my experience is that in many cases apply will not give you any significant speedup over a loop. I use apply because it's easier to write cleaner code with better flow than loops, not because I expect an automatic speedup.

However, if there are loops to do everything, that's a sign of bad R code. For example, if you are using a loop to add numbers in a vector together, that's bad code. That needs to be fixed

A lot of R is also written for exploratory analysis. So it's written without much thought to structure, scope, flow, or much of anything. It's basically like a first draft of a paper. Making this code production ready should not just be putting that code in a function - you need to step back and architect it properly.

There's also a practical matter of how fast it needs to be. I've been involved in projects where a loop based R script was run in batch once a day at 1AM. And the run time for the script was 20 minutes. If we vectorized it, maybe it would run in <1 minute. But why bother if it's run once a day?

For majority of use cases R code should be vectorized so I would say if you see a loop in the code that's a red flag and you should check it out.