Hacker News new | ask | show | jobs
by rrjjww 820 days ago
As someone who learned most of my initial coding abilities through R and RStudio in a data science context, and since moved on to more “standard” languages and IDEs, I’ve yet to find anything that comes close to the flexibility and integration of RStudio for hacking together data analytics.

VS Code/Python has made some major improvements in the past couple years but it’s still very clunky compared to the ease of running R code line by line without having to start up a debug instance. And now with copilot the most frustrating parts of R (such as remembering all the Tidyverse syntax) have been abstracted away.

12 comments

My partner does a lot of biostats in RStudio and I really think it breds terrible habits. Instead of categorizing code by files, everything is shoved into massive files. Instead of running a file top-to-bottom, code is run out-of-order which makes the code organization and flow of a program a complete disaster.

There is something to be said about running and processing large CSVs and keeping that in memory while running other parts of the program as well as having clickable access to all the dataframes loaded into memory.

There's nothing about RStudio that encourages big single files or writing huge unstructured scripts. RStudio is a pretty good IDE, and R is a highly expressive functional-first [0] language. R was heavily influenced by Scheme, and has its own powerful metaprogramming [1] system - which is used to great effect in Tidyverse[2] libraries to make APIs that are nicer and convenient than anything reasonably practical in Python.

The problem with a lot of end-user R code is that it is written by statisticians, not programmers. They'd write the same garbage and huge scripts in Python (trust me, I know).

[0] http://adv-r.had.co.nz/Functional-programming.html

[1] https://adv-r.hadley.nz/metaprogramming.html

[2] https://www.tidyverse.org/

I agree that RStudio isn't too awful, but the packaging management and reproducibility situation in R is dire, even compared to Python.

I have to deal with getting code from data scientists into production, and simply getting it to run outside of their mutant local environment can take days. Things are starting to get a bit better with packrat initially and now renv/pak/rig and the like, but most DS haven't heard of them, and major breakages between minor library versions are still commonplace, as are undocumeted system library dependencies. Then there is the whole stringsAsFactors nightmare, thankfully slowly on its way out but still around causing occasional catastrophic breakage.

There are lots of nice things about R, but it makes it very easy to shoot yourself in the foot.

Yeah, the package management situation is a big weak spot. There are some issues with renv, but it is usable. It definitely helps to keep a lid on the number of dependencies, and for God's sake never pull anything in from Bioconductor. IMO, new code should always prefer Tidyverse libs for basic stuff, and avoid relying on the ancient and warty standard library.

All that said, I still greatly prefer it over Python for DS work.

>I agree that RStudio isn't too awful, but the packaging management and reproducibility situation in R is dire, even compared to Python.

I've had exactly the opposite experience. For R, I download R and install it, and download Rstudio and install it. Then when I need a new package I just install.packages("coolnewpackage") and it just works (TM). Occasionally I get info messages about packages being built in newer versions of R, and once a year or so I eventually get around to looking up how to use the updateR() function, but in five years of doing biostats in R I can't remember a single time I had a dependency issue.

Python, on the other hand, is a nightmare. Conda makes life a lot easier, but it is not easy to learn if you are not a software engineer (remember, R was made not just by statisticians, but for them as well). For many projects, my Python flow was something like...

Try creating a new conda env with the packages I think I need. Try starting the project, oops I don't have spyder-kernels installed. Oh, and my environment isn't compatible with it. How about just running it in VScode? Well now I don't have my variable explorer. How about Jupyter? How do I get Jupyter to find my conda env again? Oh wait I need this other library it's only on conda-forge, and then the conda environment solver fails. I guess I'll start from scratch with a new conda env, and maybe after several trial-and-error sessions of carefully composing the correct "conda create -n ..." incantation in a text editor before copy-pasting them to the command line, I might get the environment I need up and running, after conda finishes its 10-minute compatibility search and downloads 80 GB of python libraries.

And using conda is the easy way of doing it! Don't even get me started on pip and venv...

With R on Windows, you get some binary dependencies, but on Linux you need the system libraries for any package that uses an external library. R uses the HTTP headers to determine which binary package to send you and no roll-your-own package system for virus scanning and the like supports either the Conda contrib patterns nor the R HTTP code binary scheme. I think Conda used to be kind of cool, but I have the same problems, and its position was always to make a ton of assumptions about what you want to do. R is like that... Sensible and automatic defaults that you can't find or aren't told about.
I have never needed anything more than pip in 8 years of development, and have always run into issues with r packages (every new version of r seems to break 30% of existing tidyverse packages)
Do you do much DS/ML in Python? I definitely agree that pip is totally fine otherwise.

At work, I've been giving out about pip to one of our DEs for a while, and when he needed to upgrade a bunch of DS packages he finally started coming around to my opinion.

Great summary of the situation. If you've ever been in the position of trying to explain to a bunch of R users why Python packaging is so much harder to deal with, you know the struggle. R/RStudio really makes it incredibly easy to get up and going for non-developers in a way that's probably hard to appreciate for many people on HN who are SWEs by trade.
Your own experience seems to disprove the claim that conda makes running analytical/numerical code easier in Python. Simple venv and pip really is the simpler choice.
I think a lot of the problem is that R does everything it can to prevent people from writing modular code.

It doesn't have modules or namespaces, and the current fashion is for packages to use non-standard evaluation which adds friction to user's writing their own functions.

R does have namespaces. Take a look at the NAMESPACE file found at the root of every R package, which defines the symbols and methods exported by the package.

Note for many R packages, the NAMESPACE file is autogenerated from roxygen docs: https://cran.r-project.org/web/packages/roxygen2/vignettes/n...

> which defines the symbols and methods exported by the package

Which are all dumped into the one single global namespace regardless if you want everything or not.

I can't remember the exact number, but tidyverse package imports literally thousands of things into your global namespace on package load, coupled with any other dependencies and you have a hell of a time figuring out where any function or constant came from.

Calling library() is kind of an antipattern in production R code. You can either call namespaced functions (like say dplyr::mutate()), or use roxygen.

https://roxygen2.r-lib.org/articles/namespace.html

> it is written by statisticians, not programmers. They'd write the same garbage in Python

I guess I should take offense as a statistician. But its a fairly common complaint. The reality is, most of us statisticians are trying to compute a result. Like once. Or sometimes twice. For a paper. Or a task. If someone comes to me with a time series and asks me to test it for stationarity, or find the p lags to make it MA(p) stationary, they aren't asking me to write a program. The goal is not reproducibility. The goal is a fast answer. I've used R at trading desks & financial institutions - the goal has seldom been "run the same program again, but with this new input". If that was the case, I would write a function & stick it in a nice library with documentation. But these aren't tech firms. We aren't shipping software. The goal is to compute something fast so you can get on with life & make the trade, or draft the next paragraph in your paper, or... Like if they give me a set of bespoke mortgages with some hairy constraints & ask me to compute the value at risk, there is not much point in building some VaR function. Because its a once in a while thing. Next time it will involve a different set of args & they'd be different constraints & so forth. So just write some 10 line script & get the number & move on. Yeah, sometimes I would stash the script in some repo & write a 1-line comment on how it works - but its kinda pointless, it doesn't get much play/reuse. We aren't programmers in that sense, we are just trying to solve problems.

My kid knocked on my office door yesterday. He's in some AoPs course where they use generating functions to count stuff. So he had a problem about the number of ways to add three odd numbers to make 1001. He had worked out the algebra & gotten some number, but before he hits Submit, he wants to doublecheck with me because wrong answers have a penalty. Now, I don't have the time to go back to school and learn what is a generating function. And I don't want to write lots of for loops & if statements & fight with syntax errors & so forth. So my 1-liner in R

dim(subset(expand.grid(a=seq(1,1001,2), b=seq(1,1001,2), c=seq(1,1001,2)), a+b+c==1001))

tells me there are 125250 ways. He says he got the same number with generating functions. Boom done! So that's what R is for. Quick & easy.

I have been an R "user" for a while now, after reading your single line approach to the problem I am reminded of the saying which goes something like this "An idiot admires complexity, A genius admires simplicity!". Perfectly splendid!
> Instead of categorizing code by files, everything is shoved into massive files.

That's not really RStudio's fault. It is just how many people use R and were taught.

> code is run out-of-order which makes the code organization and flow of a program a complete disaster.

In my experience, with R Markdown, this is untrue. I see Jupyter Notebooks with cells run out of order much more often.

I have done a lot in R Markdown, and the project I'm currently working on has me mostly working in Databricks notebooks (which are very similar to Jupyter notebooks). My execution gets out of order a lot more often in Databricks.
This is the defacto standard way of operating it I understand, which is mostly just hacking at stuff in small chunks until it sort of works and leaving comments throughout it with "run this bit on Tuesdays only".

I recently had to inherit someone's R stuff and I had to learn R and fix it all. It now runs from a makefile repeatably.

Anyway it could be worse. It could be Minitab.

> Instead of running a file top-to-bottom, code is run out-of-order which makes the code organization and flow of a program a complete disaster.

That's more a REPL issue than specific to a particular language. It's the tradeoff you make. I write my R programs in Geany and then run the whole thing using Rscript. That gives me a clean environment on every run.

Emacs + ESS? Way more flexible. Maybe less integration because many of the big R package devs work for Posit. RStudio has a lot of superfluous junk in the UI I just don't need or care about.
I've used ESS for the past few years and recently tried using RStudio when I'm on Windows. For my purposes, which is just a little industrial statistics on the side, they are remarkably similar. I feel right at home in either!
I agree - I teach statistics at a University and there is really no alternative to Rstudio for working with R. This is especially true considering that the vast majority of folk using R (in my field) have no prior programming experience. Downloading R, Vscode, downloading some R plugin, getting them to talk to each other, and only then starting to learn R - isn't very straightforward. It's also remarkably consistent on different operating systems - something to consider when half the students are on windows, half on macos...
RStudio Server on a Digital Ocean instance made my life a lot easier. Students fire up a browser, log in, and they're using R with all the packages. It was horrible when students ran R on their own machines back in the old days. Most of the questions I got were tech support rather than related to the material. And these days it has good Python support too.
This works out of the box in VSCode?

Just open a .py file, then select the snippet of code you want to run and cmd+enter

It will open a new REPL for you (using your selected interpreter) the first time, and after that all commands are run in that same one.

RStudio is just way better at choosing what code to send (if you only send the line the cursor rests on you’re gonna have a bad time. VSCode is a bit better than that but not great. Also, where does your plots get drawn when you use this? RStudio just works in this regards)
It looks like, as far as I can tell, VS Code doesn't support the interactive window for working in R, which was a bit of a surprise to me when i looked it up.

The python interactive window has pretty much fully replaced my use of jupyter, since it gives you notebook-style output without the annoyance of the notebook format. My usual workflow is highlighting lines of code and shift-enter to execute (there's also a cells syntax).

I'm surprised by this because it _is_ possible to use R in Jupyter (although I never really liked the experience, R Studio was far superior).

?

Yes it does.

I'm specifically referring to: https://code.visualstudio.com/docs/python/jupyter-support-py

The support for R looks a bit different (to me at least?): https://code.visualstudio.com/docs/languages/r

In the screenshot the window on the right does not look comparable to the output in a jupyter notebook. It looks more like a standard terminal. e.g. does it support interactive charts, html tables etc?

The Python interactive window uses the ipykernel package to allow rich outputs like that.

I still might be wrong and would like to be corrected on this, since it would mean R support in VS Code is now better than I thought (I haven't tried it fora. while)

I use r in a Jupyter Notebook in VS via IRKernel. It's a gem.
Oh - nice, thanks - so it looks like the interactive window (which is effectivey the same as the output in a jupyter notebook) is also possible, but not (yet) 'properly'/'officially' supported

https://github.com/REditorSupport/vscode-R/issues/1412

Please supply references for the audience.
An alternative in the Python world that is definitely worth looking into is the JupyterLab Desktop app, which is a standalone installer that is cross-platform and works great for beginners (no command line needed): https://github.com/jupyterlab/jupyterlab-desktop?tab=readme-...

See my other comment in the main thread with more info.

> I’ve yet to find anything that comes close to the flexibility and integration of RStudio for hacking together data analytics.

Is there a good demo or video you can point to that shows this? I have no experience with R, RStudio, or data science, but you've piqued my interest.

Any of David Robinson's (or anyone else's) Tidy Tuesday videos.

https://www.youtube.com/@safe4democracy/featured

If you work with Python, Spyder comes really, really close and is way better than jupyter
jupyter
Jupiter (ipynb) notebooks in vs code.
cat, grep, sort and awk come pretty close :)
Came here to share that same experience. RStudio truly made me feel "close" to the data.