Hacker News new | ask | show | jobs
by cameronh90 821 days ago
I agree that RStudio isn't too awful, but the packaging management and reproducibility situation in R is dire, even compared to Python.

I have to deal with getting code from data scientists into production, and simply getting it to run outside of their mutant local environment can take days. Things are starting to get a bit better with packrat initially and now renv/pak/rig and the like, but most DS haven't heard of them, and major breakages between minor library versions are still commonplace, as are undocumeted system library dependencies. Then there is the whole stringsAsFactors nightmare, thankfully slowly on its way out but still around causing occasional catastrophic breakage.

There are lots of nice things about R, but it makes it very easy to shoot yourself in the foot.

2 comments

Yeah, the package management situation is a big weak spot. There are some issues with renv, but it is usable. It definitely helps to keep a lid on the number of dependencies, and for God's sake never pull anything in from Bioconductor. IMO, new code should always prefer Tidyverse libs for basic stuff, and avoid relying on the ancient and warty standard library.

All that said, I still greatly prefer it over Python for DS work.

>I agree that RStudio isn't too awful, but the packaging management and reproducibility situation in R is dire, even compared to Python.

I've had exactly the opposite experience. For R, I download R and install it, and download Rstudio and install it. Then when I need a new package I just install.packages("coolnewpackage") and it just works (TM). Occasionally I get info messages about packages being built in newer versions of R, and once a year or so I eventually get around to looking up how to use the updateR() function, but in five years of doing biostats in R I can't remember a single time I had a dependency issue.

Python, on the other hand, is a nightmare. Conda makes life a lot easier, but it is not easy to learn if you are not a software engineer (remember, R was made not just by statisticians, but for them as well). For many projects, my Python flow was something like...

Try creating a new conda env with the packages I think I need. Try starting the project, oops I don't have spyder-kernels installed. Oh, and my environment isn't compatible with it. How about just running it in VScode? Well now I don't have my variable explorer. How about Jupyter? How do I get Jupyter to find my conda env again? Oh wait I need this other library it's only on conda-forge, and then the conda environment solver fails. I guess I'll start from scratch with a new conda env, and maybe after several trial-and-error sessions of carefully composing the correct "conda create -n ..." incantation in a text editor before copy-pasting them to the command line, I might get the environment I need up and running, after conda finishes its 10-minute compatibility search and downloads 80 GB of python libraries.

And using conda is the easy way of doing it! Don't even get me started on pip and venv...

With R on Windows, you get some binary dependencies, but on Linux you need the system libraries for any package that uses an external library. R uses the HTTP headers to determine which binary package to send you and no roll-your-own package system for virus scanning and the like supports either the Conda contrib patterns nor the R HTTP code binary scheme. I think Conda used to be kind of cool, but I have the same problems, and its position was always to make a ton of assumptions about what you want to do. R is like that... Sensible and automatic defaults that you can't find or aren't told about.
I have never needed anything more than pip in 8 years of development, and have always run into issues with r packages (every new version of r seems to break 30% of existing tidyverse packages)
Do you do much DS/ML in Python? I definitely agree that pip is totally fine otherwise.

At work, I've been giving out about pip to one of our DEs for a while, and when he needed to upgrade a bunch of DS packages he finally started coming around to my opinion.

Great summary of the situation. If you've ever been in the position of trying to explain to a bunch of R users why Python packaging is so much harder to deal with, you know the struggle. R/RStudio really makes it incredibly easy to get up and going for non-developers in a way that's probably hard to appreciate for many people on HN who are SWEs by trade.
Your own experience seems to disprove the claim that conda makes running analytical/numerical code easier in Python. Simple venv and pip really is the simpler choice.