Hacker News new | ask | show | jobs
by hadley 840 days ago
If you tell me what makes R hard to integrate into data pipelines I will do my best to fix it :)
7 comments

Wow, I really appreciate the reply. As I said in another comment here, I wish tidyverse was big when I was using R.

I was an R user from about 2003-2010.

We didn't have DPlyr at the moment though ggplot2 was coming around about that time I think. That helped alot for easy to develop visualizations.

But in our specific cases, the distributed libraries we used were written in python and integrated well with native python code. Pandas was just coming out around 2010, I think, and I think multi threading was also an issue then, but I can't really remember.

So our issues was partially our infrastructure tooling was going to python, but also we had a far easier time hiring people who were proficient in python and harder to find the same for R.

And once you start writing more code in python it starts to become harder to justify two separate code bases that can do the same thing so the R code got phased out and rewritten in python so we could have a single code base and not have to duplicate functionality in two languages.

Also a slight push for python came from the programmers who thought python represented a better language to know for their careers. Which looking back it does seem like python is used more often these days in general.

So I guess there isn't much you could have done in this case.

And as a side note, thanks for all the work you've done with R!!

A few of the main issues I see, as a R user who built his company on python

- when we wanted to build a web app that processes data, it was a lot more straightforward to build both in python, so we can process data within the web servers instead of having to manage multiple stages of infrastructure and different languages. There's no Django for R.

- R will often do something instead of explicitly failing. This is the wrong tradeoff when running a production system, as if you're returning the wrong results to users you may not realize it unless there's an error

- R reproducible builds are worse than python. That's saying something because python is a pretty low bar. But running production systems you can't have builds suddenly fail week over week because one of a hundred packages was updated

> one of a hundred packages was updated

There's renv that addresses that point already: https://rstudio.github.io/renv/articles/renv.html

> There's no Django for R.

Nowadays you can integrate R with WebR (WASM) in a web app: https://docs.r-wasm.org/webr/latest/

A lighterweight alternative to renv is to use Posit Public Package Manage (https://packagemanager.posit.co/) with a pinned date. That doesn't help if you're installing packages from a mix of places, but if you're only using CRAN packages it lets you get everything as of a fixed date.

And of course on the web side you have shiny (https://shiny.posit.co), which now also comes in a python flavour.

shiny is nice for one-off data dashboards and single-purpose mini-apps. I see the python equivalents are like dash/plotly. Shiny is not a full fledged web framework, and isn't a viable replacement for e.g. Django.

Aside -- we tried using dash in our production app and then had to remove it after a month, because these types of frameworks that spit out front-end code are almost never flexible enough to do what you actually need to do in a full app context, and you end up doing more work to fight the framework versus the time-savings from the initial prototype.

I'd highly encourage you to look into shiny more. No, it's not django, but it's a much richer framework than dash, and you can always bring your own HTML if what it generates for you isn't sufficient.
I'm not arguing that dash is better than shiny -- I think shiny is probably better!

But the fact that there's no Django for R means shiny's a dead-end for a production web app.

> R will often do something instead of explicitly failing.

I mentioned exception handling above, but this is more specifically the problem.

I think it's a hard problem to solve, because the behaviour of older libraries is so varied.

I have sometimes thought that something like a try catch wrapper which pattern matched or tested the value returned would be useful.

I have noodled on this problem a bit in https://github.com/hadley/strict, which I'm contemplating bringing back to life over the coming year. It's certainly very difficult to cover 100% of all possible problems, but I suspect we can get good coverage of the most common failure points (specifically around recycling and coercion) with a decent amount of work.
OK, since you're here!

(this all prefaced with a massive thank you for tidyverse, without which R is very crusty).

I love R for interactive work and quick analyses, but I'm currently trying to integrate various bits of R code into a large document-building pipeline and wishing I could use Python for it:

- Exception handling and error processing seem a pain in R. Maybe I'm doing it wrong, but if feels like a mess and not nearly as ergonomic as python. Trycatch seems to have gotchas related to scope because the error handling is in a function. The distinction between warning, stop etc seems odd. The option to stop on warnings isn't useful because older packages seem to abuse warnings as messages. I have just discovered `safely` which is helpful, but then you have to unwrap lists in pipelines which feels clunky.

- Related, I _really_ wish we could just drop model objects or other tibbles as single objects directly into a tibble cell rather than as list(df). Unpacking lists and checking objects inside them exist is much more of a pain (e.g. can't just do `filter(!is.na(df_col))`)

- I really miss defaultdict from python, and dictionaries generally.

- Passing variable names as strings to dynamically generate things seems clunky compared with python. Again, it may be because I'm doing to wrong but I end up having to wrap things in !!sym the whole time and the nse semantics seem hard to remember (I only use R about 20% of the time). I liked cur_data() for passing a df row to a function but this now seems deprecated.

- String formatting -- fstrings are just great. Glue is OK, but escaping special characters seems more tricksy. Jinjar is OK, not quite jinja.

- purrr is nice, but furrr just isn't a drop-in replacement. Making http requests in parallel seems non-trivial compared to doing it with python. Is there an easy way to do it without creating multiple processes? Why can't I just do something like `. %>% mutate_parallel(response=GET(url), workers=10) %>% ...`?

Amen to that. Can I add the following:

- 5 different ways to do wide to long and long to wide over the years even in the tidyverse. - A lot of dependencies to connect to DBs and difficult programs. Rstudio/Posit does have some premium libraries but they should be made free and bundled with the tidyverse to really promote the ecosystem. - Shiny support to save interactive charts and tables. This is a massive problem for me. If I have a heavily stylized HTML table with a bunch of css, I need to rely on webshot, webshot2 which are both alpha or beta versions and they are poorly documented. How can I evangelize R if my deployments cannot be used properly by my community?

What are the premium packages you're talking about? As far as I know all of our R packages are 100% open source.

I'd love to hear more why you're using webshot etc to talk screenshots of your shiny app. A more typical workflow would be to generate a separate HTML/PDF with quarto/RMarkdown.

Thanks for responding and your amazing work with the tidyverse. I am the "R-guy" in my finservices company and we have a paid rconnect dev/qa/prod and rserver pro licences for a few hundred users.

The packages I think are the dependencies of some DB connectivity libraries. https://www.rstudio.com/tags/databases/ - these are the ones I was referring to.

Re webshot my use case is: I have a heavily modified DT table in a shiny app. Users log in, play around with the DT table, update ggplots etc and then download the snapshot and send it to a WORD file. I can't move away from word and use html or pdf because we need the word file formatted by editors for publication and they need to follow the corpo guidelines. So, I am having to use webshot to grab a screenshot of the tagged html instead of natively handling it. I tried using officedown and a few other methods and it just didn't work.

ps: I hope the rebrand goes great and I am rooting for you.

Oh, you mean the pro drivers? Unfortunately we can't give those away because we have to pay several $100k a year just to get access for our customers. Most of the pro drivers do have equivalent open source versions that you should be able to use instead.

Hmmm, I'd still try generating the table with quarto (since you can output word documents), or try gt (https://gt.rstudio.com), which I know has much greater control over output, and supports RTF output (https://gt.rstudio.com/reference/as_rtf.html) which should import cleanly into word.

PDF in knitr is tied to TeX. Webshot and other capture is better because CSS styles work without translation to TeX.
> The distinction between warning, stop etc seems odd. The option to stop on warnings isn't useful because older packages seem to abuse warnings as messages.

Use suppressWarnings() to silence misbehaving functions or withCallingHandlers() to stop or handle specific conditions.

> Passing variable names as strings to dynamically generate things seems clunky compared with python.

Can you give me an elegant example in Python? Because I don't understand what you want to generate dynamically.

That said, I dislike the tidyverse solution as well. Too much abstraction for not enough benefit over a base solution with substitute()

For the most common cases, the tidyverse now only requires {{ }}. This allows you to tell tidyeval functions that you have the name of a df-var stored in an env-var. Do you have specific cases that you find frustrating?
(1) The big problem I have is transitioning from RStudio to a pipeline (so I end up not using RStudio). A traditional pipeline is going to be a script with some set of arguments -- parameter values, fitting functions, and data file names, that I put into a shell script and say:

my_plot_script.R --plot_col=g_max --output_type=pub_quality data_file1 data_file2 data_file3

It's possible to use optparse/OptionParser() to get that information (but you have an option for every argument, no --param1 X --param2 Y file1 file2 file3) but it is much more difficult to fit those arguments into the RStudio environment. I want an RStudio to be able emulate reading command line arguments (since they do not exist in RStudio). Right now, I have to check to see if there are commandArgs(), and, if not, do something else to get the information to the RStudio script.

(2) There needs to be an option that says STOP if something doesn't make sense. I have dozens of beautiful data plots that look great, but in fact do not in fact plot what I think they do, because factors have not been properly assigned to colors, shapes, or linetypes. (And it can be really hard to recognize that the data has not been plotted properly.) Give me an option that says, if I did not explicitly declare a column a factor, and I did not specifically associate colors/shapes/lines with factors, then the data will not be plotted.

(1) You might want to check out https://github.com/t-kalinowski/Rapp by my colleague Tomasz

(2) I think part of that is in scope for strict (https://github.com/hadley/strict). You might also be well served by adopting some more data validation tooling, e.g. pointblank (https://rstudio.github.io/pointblank/).

On point two, can’t you just use stopifnot(condition)? Then log it etc?
Hey Hadley!! Personally only issues for me with integrating R is making renv play nice in multistage docker builds. I found that I need to have my other pipeline software built in the same stage as my R env setup (building specific version from archive, system dependencies, then r package dependencies via renv)
It’s been more than a few years since I worked in an R shop. While I loved wrangling and plotting data in the tidy verse I did find that the dependency management story in R to be even worse than Python.

Maybe that’s the problem?

This guy is the man to ask ^^^^^^