| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vasaulys 3556 days ago
	Does anybody use R in production services or just for exploratory work? It seems that once you figure out a good model in R, its almost always rewritten into either Scala or Java for real production work.

5 comments

apohn 3556 days ago

I used to work in the consulting arm of a software firm and we wrote and deployed R code in production at many Fortune 500 companies. We worked in almost every industry.

I spent quite a bit of time refactoring bad R code so it could run reliably in a production environment. There is a ton of bad R code out there that barely works for exploratory analysis, let alone a production environment.

So yes, R is used in production environment in a lot of places.

vijucat 3555 days ago

Did you guys separate out the R process (or multiple processes?) from the rest of the transaction-processing / other server infrastructure or embed the REngine (which sounds like a bad idea to me; incorrect data serialization can easily crash the whole process)?

What is a stable way to connect (and reconnect!) to R, assuming it was a separate process? I would think that an indirect communication path, such as Server <--> Database <--> R would work best, but I'd love to hear your battle hardened take on it.

apohn 3555 days ago

We used separate workflows depending on if the data was streaming or batch oriented (e.g. on-demand or triggered by a user). First I'll talk about batch oriented jobs.

The company I worked for had a tomcat based product that exposed R via a RESTful API. It was similar to what you get from AzureML now, except it was on-premise. So basically we would call out to this and configure it to restart R sessions if they crashed or timed out.

In an ideal situation we would isolate this server from the rest of the processing as much as possible. To be honest our server was pretty basic - it basically served to queue jobs (if needed) and manage RSessions if the server was configured to run multiple sessions. For serious failover we had a second server.

We did try to do as much as possible outside of R such as data pipelining an ETL. That was done for the obvious reasons, but also because many customers had SQL and Data people, but not R people. So if one of their Data people understood the data ETL, they could fix it without calling us.

For many customers they'd never let R connect to a Database directly. So They'd have a separate process pull data and write it to disk. Then an R script would be triggered and would pick this data up.

I never saw major crashing issues with R in production with batch oriented jobs unless there was something unexpected with the size or type of data. Typically as long as there was time between jobs, R's garbage collector would sort things out and be ready for the next job. Also by the time something made it into production we'd hardened the script, frozen the CRAN package versions, etc. So some small issue wouldn't cause a major issue.

Streaming data presented it's own adventure. To get data into/out of R as quickly as possible, you need to embed the REngine and talk to it via rJava. If we streamed data through R very quickly it would do fine for a while - then you'd see the memory usage go up and the time for each transaction started to vary greatly. Then it would crash.

The solution to this was multiple Rsessions and a lot of telemetry. We would track how long each transaction took through R. As soon as we started seeing a lot of variance in the time we'd restart the engine. By running the multiple Rsessions in round-robin we'd delay the onset of this instability, and it didn't matter when R sessions needed to be restarted.

Another trick we used was to cache data in an in-memory database so if something crashed the whole service would restart and pull from the in-memory database instead of trying to fetch old data from the server.

vijucat 3551 days ago

Thanks, this is all quite useful! I faced crashes with REngine + rJava, too, and thought of a DB as a intermediary, but your in-memory DB idea adds an interesting twist that adds performance, too.

sandGorgon 3555 days ago

i have the same question - how do you use R in production ?

0x001E84EE 3552 days ago

R and its libraries are GPL licensed. Is there some corporate license available to prevent companies from being required to publish proprietary code that interacts with R? Or was the usage limited to to internal systems?

apohn 3550 days ago

Thanks to certain popular technologies like Hadoop, a lot of big companies have their legal teams looking at open source licenses as an alternatives to the big vendors like IBM. Using R and CRAN is getting easier because of this.

A lot of customers we worked with only provided outputs to external parties via reports, extracts, dashboards, etc. I don't recall a situation where an external person could run an R script (e.g. some of the companies I worked for provided their customers with BI reports). Don't ask me about the legality of that - even if I had an answer I wouldn't say it.

We used to run into all sorts of annoying issues with regards to licensing. For example, I worked at a customer where their scientists were blocked from downloading stuff from CRAN in an ad-hoc way (e.g. install.packages()). And nobody from out team was allowed to send them packages due to fear that they'd blame us for any issues with packages or package licensing.

The end result was a convoluted process for installing R, upgrading R, or anything to do with packages. During one project I was involved in a ridiculously long winded email chain discussing licensing on a particular library, with the lawyers acting like I had some sort of insight into the mind of the library author. That's the kind of resistance some organizations face when thinking about open-source tools.

ginger_beer_m 3555 days ago

What are the hallmarks of bad R codes to watch out for and avoid?

apohn 3550 days ago

What makes R code good for production is basically the same for what makes code in any language good for production. Use functions, local variables, tests, check for nulls, type issues, etc.

I wouldn't say having loops is always a bad thing. Sometimes writing loops is the only way to solve a particular problem and code loops can be easier to read and debug. Sometimes people say use the apply family of functions instead of loops, but my experience is that in many cases apply will not give you any significant speedup over a loop. I use apply because it's easier to write cleaner code with better flow than loops, not because I expect an automatic speedup.

However, if there are loops to do everything, that's a sign of bad R code. For example, if you are using a loop to add numbers in a vector together, that's bad code. That needs to be fixed

A lot of R is also written for exploratory analysis. So it's written without much thought to structure, scope, flow, or much of anything. It's basically like a first draft of a paper. Making this code production ready should not just be putting that code in a function - you need to step back and architect it properly.

There's also a practical matter of how fast it needs to be. I've been involved in projects where a loop based R script was run in batch once a day at 1AM. And the run time for the script was 20 minutes. If we vectorized it, maybe it would run in <1 minute. But why bother if it's run once a day?

ignasl 3555 days ago

For majority of use cases R code should be vectorized so I would say if you see a loop in the code that's a red flag and you should check it out.

vegabook 3556 days ago

I have 20k lines of (my own) R code running in production (used intensively by a salesforce of up to 20 people who price bonds with it) and it's an unmitigated nightmare to manage. Slow as crazy. No threading to manage concurrency so constant batch jobs everywhere. Memory hog. On Windows (this is finance), unfortunate fairly frequent crashes. No real time feeds due to the horrible architecture of the interpreter. That said, beautiful charts!

Just Say No. It'll sap your mojo. Am moving the whole thing to a blend of C, Python, and a distributed computing framework (thinking of Flink or Concord.io).

blahi 3556 days ago

That sounds like bad coders, not that R is bad.

Evidenced by:

>No threading to manage concurrency

R is used in production at EA, Activision, Ebay, Trulia, Google, Microsoft and many, many more. Those are just the ones I've seen give talks about scoring >1TBs regularly with R.

Every time somebody says R can't do be used for large data sets or is slow, I ask for more details and almost universally the programmer's complete lack of initiative is the weak link.

ignasl 3555 days ago

That's definitely sounds like a bad coders. However I would say that if someone comes from other more classic programming language background he will write a bad and slow R code by default. Especially if he is pressured into delivering fast and don't have time to search/learn the best solution. I was amazed how often you can solve something with one or two lines in R and those 2 lines will have so much better performance, better readability, maintainability and reliability than something you would do without thinking. But you have to know those 2 lines and which libraries to use etc. R actually is extremely elegant language and probably most productive language if you know what you are doing however it's not very beginner friendly (especially coming from other languages).

vegabook 3556 days ago

R just does not have robust software engineering tools for anything that even begins to resemble scale and anybody who says otherwise is denying reality. R can certainly be used in production but the skeleton framework cannot be R. RPC only in my experience with all the structure with something else. R is intrinsically single user / batch with maybe shared database but say goodbye to anything that even starts to approach real time, or multi-node dependent. In my experience the only people who insist that R is robust for production, inevitably have a vested interest. Any objective programmer can see its greatness but also its glaring flaws.

blahi 3556 days ago

Riiight. Everybody else is a bad engineer and you are the good one. With the single threaded R code...

edit: The comment above has been extended quite a bit. Initially it was a single (abrasive) sentence. I still stand by my answer however. Somebody who did not turn on multi-threading does not get to criticize R. It is the first thing you learn in any book about R. You have to be almost actively avoiding learning about it. It's in every 3rd blog post and SO question.

vegabook 3555 days ago

perhaps you might not have started your own comment with the erroneous view that 'bad coders' are to blame when R proves to be deficient at extra-design tasks.

Oh I further note your R consulting vocation. There you go. Vested interest.

BTW, I love R. But my love is not blind.

kgwgk 3556 days ago

Excel is used in production very widely, but I'm sure we all agree it has its limitations.

nerdponx 3556 days ago

Do you have personal experience with this kind of hyper-performant R code?

blahi 3556 days ago

I have experience scoring ~ 1TB daily. And a lot of smaller data sets spanning a few hundred gigs.

It's not "hyper performant". Obviously doing things in scala or C++ will be faster. However rewriting the models would take months and an entirely different set of skills. That means separate people.

But if somebody says that they use Python instead of R for the speed... that's just bull. For example one of the fundamental building blocks, pandas is slower than the counterpart in R.

vegabook 3555 days ago

this is not software engineering or production. It is batch jobs / exploratory analysis. It requires little or no structure apart from the analysis itself.

also in anything that has not been coded in C directly underneath, Python is 20x faster and C is 500× faster. R is literally the slowest mainstream language today by a long shot. That's a key consideration for production.

ignasl 3555 days ago

Where did you get those numbers from? They are most definitely wrong unless you don't vectorize your code and run loops all around. A lot of R is actually written in C so you can squeeze really good performance if you know what you are doing. I would recommend reading Hadley's Advanced R and profile your code, I think you might be pleasantly surprised.

sandGorgon 3555 days ago

could you talk about some of the learnings you had around scoring 1tb daily in R ?

How do you even load the data into memory ? is it read from a database or s3 files.

blahi 3555 days ago

In that particular case, I used Vertica which loads data in R really, really fast and straight up use a very big machine.

That's not how I approach it most of the time though. I mostly use out-of-memory algorithms, sometimes open source, sometimes Revolution's (now Microsoft). They process things in chunks. You can see BigLM and SpeedGLM for quick examples. h2o is also very popular platform. You should probably check the High Performance Comptuing CRAN Task View.

I have also used Netezza and Hana and both worked well for the purpose. There's also Teradata Aster but I don't have experience with it. There's also the open-source MonetDB which has in-database R threads and also an r package similar to rsqlite.

There are also map/reduce packages for Hadoop.

sandGorgon 3555 days ago

have you considered spark instead of flink/concord ?

vegabook 3555 days ago

I have, thanks for asking. I must admit that I have a very real priority on (soft) real time. Flink appears attractive but I also have a slight bias to non JVM which is where Concord appears interesting. I also just love Concord's "hot" DAG capability. Agreed (I think) though that I must include Spark micro-batching as a potential candidate. Any experience you have on this...I welcome links/tips. As you can probably tell I am at the very initial exploratory stage on stack choice.

md2be 3555 days ago

R is just S which is just C

dandermotj 3555 days ago

You're not wrong and absolutely totally wrong at the same time. R is the furthest thing from C you could find in paradigm, syntax and performance, but yes much of the underlying code is C or Fortran.

But really you're missing the point. R's purpose is interactive, exploratory and scientific computing and that's what it is incredibly good at. It wasn't intended for high performance computing, but there are ways of getting it there. Look out for Rho in the future.

vegabook 3555 days ago

So well put. But what is Rho? Intrigued...

dandermotj 3553 days ago

https://github.com/rho-devel/rho

baldfat 3556 days ago

> It seems that once you figure out a good model in R, its almost always rewritten into either Scala or Java for real production work.

I wouldn't say 1% of programs in R written need that speed. I personally use it for small projects (Besides a few Spark side projects) and I am out putting Reports.

I really would like someone to show an actual example of this happening in 2016.

nerdponx 3556 days ago

I do it at my company. I prototype in R, and then end up having to rewrite chunks of it in Python so it can be worked into our application, which right now is exclusively Python.

It's not a matter of performance, it's just because it would be an enormous amount of engineering overhead to start calling R from inside the Python app

RA_Fisher 3556 days ago

Check out opencpu.org, it's an R web api. Really cool stuff.

baldfat 3556 days ago

That seems like you could simply use http://jupyter.org/ and just run the script with R code inline.

http://blog.revolutionanalytics.com/2016/01/pipelining-r-pyt...

Also why not just switch to Pandas it really is a pretty close R clone.

nerdponx 3556 days ago

It has nothing to do with interoperability on my machine. I use notebooks (and Pandas) all the time, and I consider myself fluent in bith R and Python.

It's because R is a substantial engineering dependency. As I said, our entire stack is Python and Node. Yes, you can call R from Python using Rpy2, but that's a pro-bono project maintained largely by one person. It's great for casual use, but there is far too much risk to start talking about building critical business code around it.

baldfat 3556 days ago

So why not Pandas?

nerdponx 3556 days ago

Personal preference. I switch back-and-forth based on the project.

R data frames are native and feel native. Pandas data frames are non-native and can be a pain in the ass to work with.

That, and there is a lot mpre to the decision than just which data frame implementation I like better.

kgwgk 3556 days ago

"Pretty close" as long as you stay within the region of common functionality. I wouldn't say it's a clone.

baldfat 3555 days ago

That is true. I actually started my journey with Pandas and then switched to R for the ecco-system and zero based for data science drove me nuts.

But I do feel that the goal is a clone.

"Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R." http://pandas.pydata.org/

blahi 3556 days ago

How much experience do you have in statistical computing, out of curiosity?

0x001E84EE 3556 days ago

Part of that may stem from R and most (all?) of its libraries being licensed under GPL.

nerdponx 3556 days ago

Afaik Bloomberg uses it extensively for internal data visualization tools.

madenine 3556 days ago

doesn't Bloomberg have a custom, in-house R IDE?