Hacker News new | ask | show | jobs
by kuzehanka 2614 days ago
I don't think there's any reason to learn R for anyone who is already proficient at programming. Despite being proficient with R, the only times I used it in the last two years were for ggplot. And even for data vis, I'm increasingly using Python and JS.

There's a bunch of comments below which can be summed up with 'use R because <package name> doesn't have a direct python equivalent' but they're all missing the point that the Python data science ecosystem is evolving at a much faster pace than R and will completely supersede it in a few years.

R, like SAS, is a tool for non-programmers. And there it shall remain. The only demographic where R makes sense long term are pure mathematicians/statisticians who are not proficient in programming. But that demographic is rapidly declining in size.

3 comments

> There's a bunch of comments below which can be summed up with 'use R because <package name> doesn't have a direct python equivalent' but they're all missing the point that the Python data science ecosystem is evolving at a much faster pace than R and will completely supersede it in a few years.

The point is R is a very good language for statistic because of the packages not data science. Data science can do their own thing it's okay. It's also okay for data science to use statistic models from statistic too.

> R, like SAS, is a tool for non-programmers.

I respect and love data science and machine learning but this behavior of generalization is terrible. There are many wondeful programmers contribute to R and uses R as I am sure there are many wonderful statisticians that use Python. They're just tools.

> And there it shall remain. The only demographic where R makes sense long term are pure mathematicians/statisticians who are not proficient in programming. But that demographic is rapidly declining in size.

What is up with these generalizations? R is not going anywhere in the statistic community. It's doing fine. Also from my experiences in academia most math people use matlab and if any R.

It's okay to have both R and Python doing their thing.

There is no need to conflate data science and statistic or have this weird tribalism.

Everything you said sums up with 'R is a very good language for statistic because of the packages' which is pretty much in agreement with the GP comment.

R has nothing going for it except a rapidly dwindling number of packages that don't yet have a direct python equivalent. It doesn't make sense to invest time into R if one already knows python unless one specifically focusing on academia pure stats type stuff.

Even then, the incoming generation of undergrads are increasingly proficient with programming and are shying away from R the same way that they shied away from Matlab after scipy matched it for 95% of their tasks.

> R has nothing going for it except a rapidly dwindling number of packages that don't yet have a direct python equivalent.

This is not a true statement.

Here are the data that goes against this statement.

1. https://www.r-bloggers.com/on-the-growth-of-cran-packages/ 2. https://blog.revolutionanalytics.com/2017/01/cran-10000.html 3. https://www.r-bloggers.com/rs-remarkable-growth/

From 2015 to 2016: ~6,200 to More than 8,000 in April, 2016

From 2016 to 2017: CRAN now has 10,000 R packages.

> Even then, the incoming generation of undergrads are increasingly proficient with programming and are shying away from R the same way that they shied away from Matlab after scipy matched it for 95% of their tasks.

This is a generalization.

So far you've made opinionated negative generalization with no data.

Python is great because it learn from Matlab and took many great ideas and inspirations from Matlab. But I'm not going to make sweeping negative statements about Matlab or pretend to know how it going when I don't have enough data or experiences in it.

"It doesn't make sense to invest time into R if one already knows python unless one specifically focusing on academia pure stats type stuff."

Ha! I knew it! So it is familiarity with Python then!

R does have something else going for it: phenomenal documentation and consistency. Replicating R's thousands of available libraries will be a gargantuan effort. It is cheaper and more efficient to master R.

Tidyverse is not just some "<package name>" -- it's an entire workflow, centered around functional programming and tidy data (https://vita.had.co.nz/papers/tidy-data.pdf), and nothing in Python comes close. R has many warts, but its lisp roots and metaprogramming strengths have allowed the tidyverse devs, and other excellent programmers working with R, to dramatically improve the language, and spawn a whole new style of statistical programming.
Can you elaborate on what tidyverse offers you that the python ecosystem doesn't? 'Nothing comes close' is a couple degrees too strong a statement from my experience with R, but maybe you know something I don't.
Tidyverse offers a programming style based around piping dataframes through a chain of endomorphisms ("verbs"). Closest things that come to mind are SQL and d3. Pandas feels clumsy by comparison.
Uhhh but pandas is literally a chained architecture? Have you actually used it?
I have used pandas extensively, it was my main statistics environment for a couple years before I switched back to R for tidyverse. At the time chaining was not well supported or idiomatic; multi-indexing was all the rage.

I still occasionally use pandas with seaborn when it's not worth it to switch out to R. I don't think it can match the tidyverse+ggplot combo for quickly exploring and making beautiful plots. But this discussion has inspired me to do some googling and it seems like some people are using tidyverse-like workflows in pandas (https://stmorse.github.io/journal/tidyverse-style-pandas.htm...). Doesn't seem quite as smooth but I'll definitely be trying it out next time I'm working in pandas.

I've used both and have two additional comments.

Some of the dplyr elegance comes from the flexible evaluation mechanism in R, whereby mutate(data, col1+col2) works because the second arg is evaluated in an enriched environment. Python eschews this kind of macro-like extensions because, my guess, tampering with evaluation makes a lot of other things complicated (for instance, forget replacing args with their value, that doesn't work anymore). I think the author of dplyr himself in later work has promoted the use of the ~ operator to explicitly block eval of an argument and at least make these departures from regular eval explicit. That means dplyr is ahead for interactive use, but for programming you have to switch to a separate API (the underscore "verbs") and that makes the transition from interactive work to coding a bit steeper. It's all trade-offs, and I am not saying that I know better than either the pandas or dplyr authors.

As to ggplot, if you believe the future of statistical graphics is in-browser and interactive, you should take a look at altair for python (I myself created a small extension to it called altair_recipes). It's based on vega, like ggplot anointed (but not quite ready) successor ggvis and uses the grammar of graphics (or on interpretation thereof) like ggplot, with extensions to interaction. Simpler than D3 by most accounts.

I cannot understand why I would use Python over R. R is designed from the ground up for massive amounts of data processing at speed and with ease. Even if Python continues accreting computational functionality, it will never be as fast or as efficient as R. Improving Python for something R is designed to do seems to me to be a huge waste of time: familiarity should not be the driving force behind replicating R's functionality. That's just so wrong.
> R is designed from the ground up for massive amounts of data processing at speed

What? The R ecosystem doesn't provide meaningful out of core capabilities, nevermind the ability to handle anything approaching 'massive amounts of data'.

-- Would sure love to know why an agenda-less factual comment is getting downvoted.

In my experience, R is really fast since I t was designed to store data in columnar format which we now all know is best for data analysis. So, in most cases, scaling up computation is quite easy. To scale out, you can use Apache Spark with R, the interface I’ve worked on, sparklyr is quite easy to use and allows you to scale out computation. Just to give you an example of what’s possible, I was playing around yesterday with a ray tracing prototype someone is building and scaled it out in Spark, see https://twitter.com/javierluraschi/status/112055769372135424... — it’s a misconception that R is slow or can’t scale.
You can plug any compute kernel you want into spark, that's not a pro or con of R.

Column stores are standard in any analytics pipeline today. They make up Python's Pandas, R's dplyr, and Java's DataFrame. How or why does R stand out for 'massive amounts of data'?

R does not have have meaningful out of core compute offerings that compare with something like Dask.

R does not at all have cluster compute offerings that compare to Dask Distributed.

If you want to know what real performance looks like, check out Python's cudf which will shortly fully match the Pandas api. That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.

Whatever advantages R has, perf or scalability are definitely not amongst them.

You are arguing for Python and speed in the same breath? If you want portable speed, you better "warm up a chair" and master Fortran.

Bonus: modern Fortran is a joy to develop in, far more fun than Python. And you get to compile to machine code, either for a processor or a GPU.

> That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.

I don't see how the "GPU DataFrames" provided in cuDF would enhance a raytracer in any way.

You don’t see how a gpu accelerated numeric array would speed up ray tracing?