Hacker News new | ask | show | jobs
by mjhay 820 days ago
There's nothing about RStudio that encourages big single files or writing huge unstructured scripts. RStudio is a pretty good IDE, and R is a highly expressive functional-first [0] language. R was heavily influenced by Scheme, and has its own powerful metaprogramming [1] system - which is used to great effect in Tidyverse[2] libraries to make APIs that are nicer and convenient than anything reasonably practical in Python.

The problem with a lot of end-user R code is that it is written by statisticians, not programmers. They'd write the same garbage and huge scripts in Python (trust me, I know).

[0] http://adv-r.had.co.nz/Functional-programming.html

[1] https://adv-r.hadley.nz/metaprogramming.html

[2] https://www.tidyverse.org/

3 comments

I agree that RStudio isn't too awful, but the packaging management and reproducibility situation in R is dire, even compared to Python.

I have to deal with getting code from data scientists into production, and simply getting it to run outside of their mutant local environment can take days. Things are starting to get a bit better with packrat initially and now renv/pak/rig and the like, but most DS haven't heard of them, and major breakages between minor library versions are still commonplace, as are undocumeted system library dependencies. Then there is the whole stringsAsFactors nightmare, thankfully slowly on its way out but still around causing occasional catastrophic breakage.

There are lots of nice things about R, but it makes it very easy to shoot yourself in the foot.

Yeah, the package management situation is a big weak spot. There are some issues with renv, but it is usable. It definitely helps to keep a lid on the number of dependencies, and for God's sake never pull anything in from Bioconductor. IMO, new code should always prefer Tidyverse libs for basic stuff, and avoid relying on the ancient and warty standard library.

All that said, I still greatly prefer it over Python for DS work.

>I agree that RStudio isn't too awful, but the packaging management and reproducibility situation in R is dire, even compared to Python.

I've had exactly the opposite experience. For R, I download R and install it, and download Rstudio and install it. Then when I need a new package I just install.packages("coolnewpackage") and it just works (TM). Occasionally I get info messages about packages being built in newer versions of R, and once a year or so I eventually get around to looking up how to use the updateR() function, but in five years of doing biostats in R I can't remember a single time I had a dependency issue.

Python, on the other hand, is a nightmare. Conda makes life a lot easier, but it is not easy to learn if you are not a software engineer (remember, R was made not just by statisticians, but for them as well). For many projects, my Python flow was something like...

Try creating a new conda env with the packages I think I need. Try starting the project, oops I don't have spyder-kernels installed. Oh, and my environment isn't compatible with it. How about just running it in VScode? Well now I don't have my variable explorer. How about Jupyter? How do I get Jupyter to find my conda env again? Oh wait I need this other library it's only on conda-forge, and then the conda environment solver fails. I guess I'll start from scratch with a new conda env, and maybe after several trial-and-error sessions of carefully composing the correct "conda create -n ..." incantation in a text editor before copy-pasting them to the command line, I might get the environment I need up and running, after conda finishes its 10-minute compatibility search and downloads 80 GB of python libraries.

And using conda is the easy way of doing it! Don't even get me started on pip and venv...

With R on Windows, you get some binary dependencies, but on Linux you need the system libraries for any package that uses an external library. R uses the HTTP headers to determine which binary package to send you and no roll-your-own package system for virus scanning and the like supports either the Conda contrib patterns nor the R HTTP code binary scheme. I think Conda used to be kind of cool, but I have the same problems, and its position was always to make a ton of assumptions about what you want to do. R is like that... Sensible and automatic defaults that you can't find or aren't told about.
I have never needed anything more than pip in 8 years of development, and have always run into issues with r packages (every new version of r seems to break 30% of existing tidyverse packages)
Do you do much DS/ML in Python? I definitely agree that pip is totally fine otherwise.

At work, I've been giving out about pip to one of our DEs for a while, and when he needed to upgrade a bunch of DS packages he finally started coming around to my opinion.

Great summary of the situation. If you've ever been in the position of trying to explain to a bunch of R users why Python packaging is so much harder to deal with, you know the struggle. R/RStudio really makes it incredibly easy to get up and going for non-developers in a way that's probably hard to appreciate for many people on HN who are SWEs by trade.
Your own experience seems to disprove the claim that conda makes running analytical/numerical code easier in Python. Simple venv and pip really is the simpler choice.
I think a lot of the problem is that R does everything it can to prevent people from writing modular code.

It doesn't have modules or namespaces, and the current fashion is for packages to use non-standard evaluation which adds friction to user's writing their own functions.

R does have namespaces. Take a look at the NAMESPACE file found at the root of every R package, which defines the symbols and methods exported by the package.

Note for many R packages, the NAMESPACE file is autogenerated from roxygen docs: https://cran.r-project.org/web/packages/roxygen2/vignettes/n...

> which defines the symbols and methods exported by the package

Which are all dumped into the one single global namespace regardless if you want everything or not.

I can't remember the exact number, but tidyverse package imports literally thousands of things into your global namespace on package load, coupled with any other dependencies and you have a hell of a time figuring out where any function or constant came from.

Calling library() is kind of an antipattern in production R code. You can either call namespaced functions (like say dplyr::mutate()), or use roxygen.

https://roxygen2.r-lib.org/articles/namespace.html

Agreed but the GP isn't wrong. It's much much nicer to import a library with an alias in Python.
> it is written by statisticians, not programmers. They'd write the same garbage in Python

I guess I should take offense as a statistician. But its a fairly common complaint. The reality is, most of us statisticians are trying to compute a result. Like once. Or sometimes twice. For a paper. Or a task. If someone comes to me with a time series and asks me to test it for stationarity, or find the p lags to make it MA(p) stationary, they aren't asking me to write a program. The goal is not reproducibility. The goal is a fast answer. I've used R at trading desks & financial institutions - the goal has seldom been "run the same program again, but with this new input". If that was the case, I would write a function & stick it in a nice library with documentation. But these aren't tech firms. We aren't shipping software. The goal is to compute something fast so you can get on with life & make the trade, or draft the next paragraph in your paper, or... Like if they give me a set of bespoke mortgages with some hairy constraints & ask me to compute the value at risk, there is not much point in building some VaR function. Because its a once in a while thing. Next time it will involve a different set of args & they'd be different constraints & so forth. So just write some 10 line script & get the number & move on. Yeah, sometimes I would stash the script in some repo & write a 1-line comment on how it works - but its kinda pointless, it doesn't get much play/reuse. We aren't programmers in that sense, we are just trying to solve problems.

My kid knocked on my office door yesterday. He's in some AoPs course where they use generating functions to count stuff. So he had a problem about the number of ways to add three odd numbers to make 1001. He had worked out the algebra & gotten some number, but before he hits Submit, he wants to doublecheck with me because wrong answers have a penalty. Now, I don't have the time to go back to school and learn what is a generating function. And I don't want to write lots of for loops & if statements & fight with syntax errors & so forth. So my 1-liner in R

dim(subset(expand.grid(a=seq(1,1001,2), b=seq(1,1001,2), c=seq(1,1001,2)), a+b+c==1001))

tells me there are 125250 ways. He says he got the same number with generating functions. Boom done! So that's what R is for. Quick & easy.

I have been an R "user" for a while now, after reading your single line approach to the problem I am reminded of the saying which goes something like this "An idiot admires complexity, A genius admires simplicity!". Perfectly splendid!