I used to work in the consulting arm of a software firm and we wrote and deployed R code in production at many Fortune 500 companies. We worked in almost every industry.
I spent quite a bit of time refactoring bad R code so it could run reliably in a production environment. There is a ton of bad R code out there that barely works for exploratory analysis, let alone a production environment.
So yes, R is used in production environment in a lot of places.
Did you guys separate out the R process (or multiple processes?) from the rest of the transaction-processing / other server infrastructure or embed the REngine (which sounds like a bad idea to me; incorrect data serialization can easily crash the whole process)?
What is a stable way to connect (and reconnect!) to R, assuming it was a separate process? I would think that an indirect communication path, such as Server <--> Database <--> R would work best, but I'd love to hear your battle hardened take on it.
We used separate workflows depending on if the data was streaming or batch oriented (e.g. on-demand or triggered by a user). First I'll talk about batch oriented jobs.
The company I worked for had a tomcat based product that exposed R via a RESTful API. It was similar to what you get from AzureML now, except it was on-premise. So basically we would call out to this and configure it to restart R sessions if they crashed or timed out.
In an ideal situation we would isolate this server from the rest of the processing as much as possible. To be honest our server was pretty basic - it basically served to queue jobs (if needed) and manage RSessions if the server was configured to run multiple sessions. For serious failover we had a second server.
We did try to do as much as possible outside of R such as data pipelining an ETL. That was done for the obvious reasons, but also because many customers had SQL and Data people, but not R people. So if one of their Data people understood the data ETL, they could fix it without calling us.
For many customers they'd never let R connect to a Database directly. So They'd have a separate process pull data and write it to disk. Then an R script would be triggered and would pick this data up.
I never saw major crashing issues with R in production with batch oriented jobs unless there was something unexpected with the size or type of data. Typically as long as there was time between jobs, R's garbage collector would sort things out and be ready for the next job. Also by the time something made it into production we'd hardened the script, frozen the CRAN package versions, etc. So some small issue wouldn't cause a major issue.
Streaming data presented it's own adventure. To get data into/out of R as quickly as possible, you need to embed the REngine and talk to it via rJava. If we streamed data through R very quickly it would do fine for a while - then you'd see the memory usage go up and the time for each transaction started to vary greatly. Then it would crash.
The solution to this was multiple Rsessions and a lot of telemetry. We would track how long each transaction took through R. As soon as we started seeing a lot of variance in the time we'd restart the engine. By running the multiple Rsessions in round-robin we'd delay the onset of this instability, and it didn't matter when R sessions needed to be restarted.
Another trick we used was to cache data in an in-memory database so if something crashed the whole service would restart and pull from the in-memory database instead of trying to fetch old data from the server.
Thanks, this is all quite useful! I faced crashes with REngine + rJava, too, and thought of a DB as a intermediary, but your in-memory DB idea adds an interesting twist that adds performance, too.
R and its libraries are GPL licensed. Is there some corporate license available to prevent companies from being required to publish proprietary code that interacts with R? Or was the usage limited to to internal systems?
Thanks to certain popular technologies like Hadoop, a lot of big companies have their legal teams looking at open source licenses as an alternatives to the big vendors like IBM. Using R and CRAN is getting easier because of this.
A lot of customers we worked with only provided outputs to external parties via reports, extracts, dashboards, etc. I don't recall a situation where an external person could run an R script (e.g. some of the companies I worked for provided their customers with BI reports). Don't ask me about the legality of that - even if I had an answer I wouldn't say it.
We used to run into all sorts of annoying issues with regards to licensing. For example, I worked at a customer where their scientists were blocked from downloading stuff from CRAN in an ad-hoc way (e.g. install.packages()). And nobody from out team was allowed to send them packages due to fear that they'd blame us for any issues with packages or package licensing.
The end result was a convoluted process for installing R, upgrading R, or anything to do with packages. During one project I was involved in a ridiculously long winded email chain discussing licensing on a particular library, with the lawyers acting like I had some sort of insight into the mind of the library author. That's the kind of resistance some organizations face when thinking about open-source tools.
What makes R code good for production is basically the same for what makes code in any language good for production. Use functions, local variables, tests, check for nulls, type issues, etc.
I wouldn't say having loops is always a bad thing. Sometimes writing loops is the only way to solve a particular problem and code loops can be easier to read and debug. Sometimes people say use the apply family of functions instead of loops, but my experience is that in many cases apply will not give you any significant speedup over a loop. I use apply because it's easier to write cleaner code with better flow than loops, not because I expect an automatic speedup.
However, if there are loops to do everything, that's a sign of bad R code. For example, if you are using a loop to add numbers in a vector together, that's bad code. That needs to be fixed
A lot of R is also written for exploratory analysis. So it's written without much thought to structure, scope, flow, or much of anything. It's basically like a first draft of a paper. Making this code production ready should not just be putting that code in a function - you need to step back and architect it properly.
There's also a practical matter of how fast it needs to be. I've been involved in projects where a loop based R script was run in batch once a day at 1AM. And the run time for the script was 20 minutes. If we vectorized it, maybe it would run in <1 minute. But why bother if it's run once a day?
I have 20k lines of (my own) R code running in production (used intensively by a salesforce of up to 20 people who price bonds with it) and it's an unmitigated nightmare to manage. Slow as crazy. No threading to manage concurrency so constant batch jobs everywhere. Memory hog. On Windows (this is finance), unfortunate fairly frequent crashes. No real time feeds due to the horrible architecture of the interpreter. That said, beautiful charts!
Just Say No. It'll sap your mojo. Am moving the whole thing to a blend of C, Python, and a distributed computing framework (thinking of Flink or Concord.io).
R is used in production at EA, Activision, Ebay, Trulia, Google, Microsoft and many, many more. Those are just the ones I've seen give talks about scoring >1TBs regularly with R.
Every time somebody says R can't do be used for large data sets or is slow, I ask for more details and almost universally the programmer's complete lack of initiative is the weak link.
That's definitely sounds like a bad coders. However I would say that if someone comes from other more classic programming language background he will write a bad and slow R code by default. Especially if he is pressured into delivering fast and don't have time to search/learn the best solution. I was amazed how often you can solve something with one or two lines in R and those 2 lines will have so much better performance, better readability, maintainability and reliability than something you would do without thinking. But you have to know those 2 lines and which libraries to use etc. R actually is extremely elegant language and probably most productive language if you know what you are doing however it's not very beginner friendly (especially coming from other languages).
R just does not have robust software engineering tools for anything that even begins to resemble scale and anybody who says otherwise is denying reality. R can certainly be used in production but the skeleton framework cannot be R. RPC only in my experience with all the structure with something else. R is intrinsically single user / batch with maybe shared database but say goodbye to anything that even starts to approach real time, or multi-node dependent.
In my experience the only people who insist that R is robust for production, inevitably have a vested interest. Any objective programmer can see its greatness but also its glaring flaws.
Riiight. Everybody else is a bad engineer and you are the good one. With the single threaded R code...
edit: The comment above has been extended quite a bit. Initially it was a single (abrasive) sentence. I still stand by my answer however. Somebody who did not turn on multi-threading does not get to criticize R. It is the first thing you learn in any book about R. You have to be almost actively avoiding learning about it. It's in every 3rd blog post and SO question.
perhaps you might not have started your own comment with the erroneous view that 'bad coders' are to blame when R proves to be deficient at extra-design tasks.
Oh I further note your R consulting vocation. There you go. Vested interest.
I have experience scoring ~ 1TB daily. And a lot of smaller data sets spanning a few hundred gigs.
It's not "hyper performant". Obviously doing things in scala or C++ will be faster. However rewriting the models would take months and an entirely different set of skills. That means separate people.
But if somebody says that they use Python instead of R for the speed... that's just bull. For example one of the fundamental building blocks, pandas is slower than the counterpart in R.
this is not software engineering or production. It is batch jobs / exploratory analysis. It requires little or no structure apart from the analysis itself.
also in anything that has not been coded in C directly underneath, Python is 20x faster and C is 500× faster. R is literally the slowest mainstream language today by a long shot. That's a key consideration for production.
Where did you get those numbers from? They are most definitely wrong unless you don't vectorize your code and run loops all around. A lot of R is actually written in C so you can squeeze really good performance if you know what you are doing. I would recommend reading Hadley's Advanced R and profile your code, I think you might be pleasantly surprised.
In that particular case, I used Vertica which loads data in R really, really fast and straight up use a very big machine.
That's not how I approach it most of the time though. I mostly use out-of-memory algorithms, sometimes open source, sometimes Revolution's (now Microsoft). They process things in chunks. You can see BigLM and SpeedGLM for quick examples. h2o is also very popular platform. You should probably check the High Performance Comptuing CRAN Task View.
I have also used Netezza and Hana and both worked well for the purpose. There's also Teradata Aster but I don't have experience with it. There's also the open-source MonetDB which has in-database R threads and also an r package similar to rsqlite.
I have, thanks for asking. I must admit that I have a very real priority on (soft) real time. Flink appears attractive but I also have a slight bias to non JVM which is where Concord appears interesting. I also just love Concord's "hot" DAG capability. Agreed (I think) though that I must include Spark micro-batching as a potential candidate. Any experience you have on this...I welcome links/tips. As you can probably tell I am at the very initial exploratory stage on stack choice.
You're not wrong and absolutely totally wrong at the same time. R is the furthest thing from C you could find in paradigm, syntax and performance, but yes much of the underlying code is C or Fortran.
But really you're missing the point. R's purpose is interactive, exploratory and scientific computing and that's what it is incredibly good at. It wasn't intended for high performance computing, but there are ways of getting it there. Look out for Rho in the future.
> It seems that once you figure out a good model in R, its almost always rewritten into either Scala or Java for real production work.
I wouldn't say 1% of programs in R written need that speed. I personally use it for small projects (Besides a few Spark side projects) and I am out putting Reports.
I really would like someone to show an actual example of this happening in 2016.
I do it at my company. I prototype in R, and then end up having to rewrite chunks of it in Python so it can be worked into our application, which right now is exclusively Python.
It's not a matter of performance, it's just because it would be an enormous amount of engineering overhead to start calling R from inside the Python app
It has nothing to do with interoperability on my machine. I use notebooks (and Pandas) all the time, and I consider myself fluent in bith R and Python.
It's because R is a substantial engineering dependency. As I said, our entire stack is Python and Node. Yes, you can call R from Python using Rpy2, but that's a pro-bono project maintained largely by one person. It's great for casual use, but there is far too much risk to start talking about building critical business code around it.
That is true. I actually started my journey with Pandas and then switched to R for the ecco-system and zero based for data science drove me nuts.
But I do feel that the goal is a clone.
"Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R." http://pandas.pydata.org/
I spent quite a bit of time refactoring bad R code so it could run reliably in a production environment. There is a ton of bad R code out there that barely works for exploratory analysis, let alone a production environment.
So yes, R is used in production environment in a lot of places.