| HN Mirror

We used separate workflows depending on if the data was streaming or batch oriented (e.g. on-demand or triggered by a user). First I'll talk about batch oriented jobs.

The company I worked for had a tomcat based product that exposed R via a RESTful API. It was similar to what you get from AzureML now, except it was on-premise. So basically we would call out to this and configure it to restart R sessions if they crashed or timed out.

In an ideal situation we would isolate this server from the rest of the processing as much as possible. To be honest our server was pretty basic - it basically served to queue jobs (if needed) and manage RSessions if the server was configured to run multiple sessions. For serious failover we had a second server.

We did try to do as much as possible outside of R such as data pipelining an ETL. That was done for the obvious reasons, but also because many customers had SQL and Data people, but not R people. So if one of their Data people understood the data ETL, they could fix it without calling us.

For many customers they'd never let R connect to a Database directly. So They'd have a separate process pull data and write it to disk. Then an R script would be triggered and would pick this data up.

I never saw major crashing issues with R in production with batch oriented jobs unless there was something unexpected with the size or type of data. Typically as long as there was time between jobs, R's garbage collector would sort things out and be ready for the next job. Also by the time something made it into production we'd hardened the script, frozen the CRAN package versions, etc. So some small issue wouldn't cause a major issue.

Streaming data presented it's own adventure. To get data into/out of R as quickly as possible, you need to embed the REngine and talk to it via rJava. If we streamed data through R very quickly it would do fine for a while - then you'd see the memory usage go up and the time for each transaction started to vary greatly. Then it would crash.

The solution to this was multiple Rsessions and a lot of telemetry. We would track how long each transaction took through R. As soon as we started seeing a lot of variance in the time we'd restart the engine. By running the multiple Rsessions in round-robin we'd delay the onset of this instability, and it didn't matter when R sessions needed to be restarted.

Another trick we used was to cache data in an in-memory database so if something crashed the whole service would restart and pull from the in-memory database instead of trying to fetch old data from the server.