Hacker News new | ask | show | jobs
by billfruit 2610 days ago
I sometimes wonder is there any reason to learn R at all, since python eco system has absorbed most of its advanced statistical functionality, coupled with the factor that python environment is much more general, with capabilities to fetch, decode/encoded data,work with binary data databases, web frameworks for presenting etc.
9 comments

I use both Python and R. tidyverse/ggplot2 alone are enough reason to use R, and are substantially faster for tasks that utilize those packages than the equivalent in Python (in my opinion).

Although I haven't had as much reason to use base R. For more ML-related tasks I do go back to Python.

Here here. Tidyverse also provides a centralised 'this is how you do X' nexus really helps discover-ability. World class stuff, on tap.

For example, I know the recommended pipe in R is magrittr's %>%. I have no idea what the respectable pipe library in Python is, or even if there is one.

I wouldn't even know where to start finding all the tidyverse equivalents in Python. It isn't as organised and obvious as the R statistics community.

On the other hand Base R is the worst. Disgusting language.

Julia has a |> operator and it works amazingly well with Queryverse.jl which is a clone of Tidyverse!
This. I’ve contributed code to popular libraries in both languages, and while I (overall) have a preference for python (mostly due to it being general purpose), I find R code unparalleled when it comes to raw data manipulation/analysis.

The overall api of tidyverse packages is such a joy, and recent improvements in purrr/tidyr allow me to construct nested data analysis workflows I couldn’t even dream of in python.

One random example I found recently is a tidyverse package called forcats that has lots of nice functions for categorical data. For example, it has a single function that merges all categories with a frequency of less than a certain threshold in the table into a new category like "other" or whatever. This is a task I often need to do, but as far as I can see it's a bit of a hack in python or pandas. It's just lots of little things like this, especially wrangling data tables.

https://forcats.tidyverse.org/reference/fct_lump.html

There's also the data.table package for this kind of data work, which is maybe less used but seems to have better performance.

Would you have an example of that?
Seconded on all points. I do branch out to SQL for stuff, too, and I find that R and Python play nicely with it, too. But as long as ggplot exists and Python doesn't have it, R will never really leave my side.
I'm finding a fairly nice combo is using rmarkdown, python and reticulate to do the things that are easier in python there and the outputs in R. Debugging isn't where I'd like it to be yet but there might be a way of improving that - I haven't explored yet.
> since python eco system has absorbed most of its advanced statistical functionality

This isn't true at all...

Also all advance statistical books are either SAS or R. If it's R then there is always a package that the author created.

Just look at Chapman & Hall/CRC or Springer publisher and look at their books.

Go here: https://www.jstatsoft.org/index

Count the number of R packages in those papers versus Python.

I don't even need a source. I'm a statistician and I'm going to get a paper there and publish a R package for my master thesis.

I use both python and R almost every day.

Although I like R and often use R to quickly order tabulated data, there are a few things to take into account that in recent times are building a strong case for me not to use R habitually.

Development in R is frustrating. If you don't need to do dev, then on this point you are home free. Testing things that you deploy in R is not simple.

Scripting in R can be frustrating. I have a script that traverses Excel files and using tryCatch() is just so much more complicated with it being a function. In Python the try-catch functionality is part of the design syntax.

There are scenarios where R is better. If you are in actuarial science, research or academics then often you'll find R libraries that just work.

R treats tabular data with grace. Everything in R is an array.

The takeaway for me is that I should use R less and Python more. I personally can't deal with something like tryCatch() being overcomplicated, but for people who don't do dev anyway and maybe need to analyse DNA sequences for a living, R can be rewarding. For me: the ggplot2 library is great; stay away from Shiny and dev in R.

Interesting, why do you advise people to stay away from Shiny?
It tries to do html, but it is limited. So I'd rather use Javascript to manipulate the frontend directly.

It tries to do functional programming, but the documentation is not satisfying. The responses and behaviour is perplexing.

I spent around 5—10 hours trying to get a Shiny GUI to work and eventually got to the conclusion that 1) if you want a big project do all the frontend stuff in something else, like JS and 2) if you want a small project try something established (I am not advocating, it's just an example) like Power BI.

Regarding the limited frontend capabilities, I had a similar opinion at one point, but with Shiny's HTML templating (https://shiny.rstudio.com/articles/templates.html) functionality one can circumvent the limited HTML that Shiny has out of the box. Besides that, there is also the possibility to communicate with R using JavaScript (https://shiny.rstudio.com/articles/communicating-with-js.htm...). These two functionalities combined allow for a frontend that is much more flexible, when compared to traditional Shiny applications. Of course, there might definitely be better solutions out there that fit your use case and Shiny's real use is primarily in sharing data analysis.
Not really in my experience. Really, the only place where I'd say Python has gotten more support so far than R is in deep learning. If you want any just-published statistical method, the associated implementation will almost inevitably be in R. But that's today -- I'm old enough to remember when the standard language in "The Journal of Statistical Software" was XLISP-STAT (much of the 1990s).
I use mixed effects models pretty extensively. While there is an implementation in statsmodels the implementation in lme4 is more user friendly and has a more mature ecosystem of post-hoc tests.
I don't think there's any reason to learn R for anyone who is already proficient at programming. Despite being proficient with R, the only times I used it in the last two years were for ggplot. And even for data vis, I'm increasingly using Python and JS.

There's a bunch of comments below which can be summed up with 'use R because <package name> doesn't have a direct python equivalent' but they're all missing the point that the Python data science ecosystem is evolving at a much faster pace than R and will completely supersede it in a few years.

R, like SAS, is a tool for non-programmers. And there it shall remain. The only demographic where R makes sense long term are pure mathematicians/statisticians who are not proficient in programming. But that demographic is rapidly declining in size.

> There's a bunch of comments below which can be summed up with 'use R because <package name> doesn't have a direct python equivalent' but they're all missing the point that the Python data science ecosystem is evolving at a much faster pace than R and will completely supersede it in a few years.

The point is R is a very good language for statistic because of the packages not data science. Data science can do their own thing it's okay. It's also okay for data science to use statistic models from statistic too.

> R, like SAS, is a tool for non-programmers.

I respect and love data science and machine learning but this behavior of generalization is terrible. There are many wondeful programmers contribute to R and uses R as I am sure there are many wonderful statisticians that use Python. They're just tools.

> And there it shall remain. The only demographic where R makes sense long term are pure mathematicians/statisticians who are not proficient in programming. But that demographic is rapidly declining in size.

What is up with these generalizations? R is not going anywhere in the statistic community. It's doing fine. Also from my experiences in academia most math people use matlab and if any R.

It's okay to have both R and Python doing their thing.

There is no need to conflate data science and statistic or have this weird tribalism.

Everything you said sums up with 'R is a very good language for statistic because of the packages' which is pretty much in agreement with the GP comment.

R has nothing going for it except a rapidly dwindling number of packages that don't yet have a direct python equivalent. It doesn't make sense to invest time into R if one already knows python unless one specifically focusing on academia pure stats type stuff.

Even then, the incoming generation of undergrads are increasingly proficient with programming and are shying away from R the same way that they shied away from Matlab after scipy matched it for 95% of their tasks.

> R has nothing going for it except a rapidly dwindling number of packages that don't yet have a direct python equivalent.

This is not a true statement.

Here are the data that goes against this statement.

1. https://www.r-bloggers.com/on-the-growth-of-cran-packages/ 2. https://blog.revolutionanalytics.com/2017/01/cran-10000.html 3. https://www.r-bloggers.com/rs-remarkable-growth/

From 2015 to 2016: ~6,200 to More than 8,000 in April, 2016

From 2016 to 2017: CRAN now has 10,000 R packages.

> Even then, the incoming generation of undergrads are increasingly proficient with programming and are shying away from R the same way that they shied away from Matlab after scipy matched it for 95% of their tasks.

This is a generalization.

So far you've made opinionated negative generalization with no data.

Python is great because it learn from Matlab and took many great ideas and inspirations from Matlab. But I'm not going to make sweeping negative statements about Matlab or pretend to know how it going when I don't have enough data or experiences in it.

"It doesn't make sense to invest time into R if one already knows python unless one specifically focusing on academia pure stats type stuff."

Ha! I knew it! So it is familiarity with Python then!

R does have something else going for it: phenomenal documentation and consistency. Replicating R's thousands of available libraries will be a gargantuan effort. It is cheaper and more efficient to master R.

Tidyverse is not just some "<package name>" -- it's an entire workflow, centered around functional programming and tidy data (https://vita.had.co.nz/papers/tidy-data.pdf), and nothing in Python comes close. R has many warts, but its lisp roots and metaprogramming strengths have allowed the tidyverse devs, and other excellent programmers working with R, to dramatically improve the language, and spawn a whole new style of statistical programming.
Can you elaborate on what tidyverse offers you that the python ecosystem doesn't? 'Nothing comes close' is a couple degrees too strong a statement from my experience with R, but maybe you know something I don't.
Tidyverse offers a programming style based around piping dataframes through a chain of endomorphisms ("verbs"). Closest things that come to mind are SQL and d3. Pandas feels clumsy by comparison.
Uhhh but pandas is literally a chained architecture? Have you actually used it?
I have used pandas extensively, it was my main statistics environment for a couple years before I switched back to R for tidyverse. At the time chaining was not well supported or idiomatic; multi-indexing was all the rage.

I still occasionally use pandas with seaborn when it's not worth it to switch out to R. I don't think it can match the tidyverse+ggplot combo for quickly exploring and making beautiful plots. But this discussion has inspired me to do some googling and it seems like some people are using tidyverse-like workflows in pandas (https://stmorse.github.io/journal/tidyverse-style-pandas.htm...). Doesn't seem quite as smooth but I'll definitely be trying it out next time I'm working in pandas.

I cannot understand why I would use Python over R. R is designed from the ground up for massive amounts of data processing at speed and with ease. Even if Python continues accreting computational functionality, it will never be as fast or as efficient as R. Improving Python for something R is designed to do seems to me to be a huge waste of time: familiarity should not be the driving force behind replicating R's functionality. That's just so wrong.
> R is designed from the ground up for massive amounts of data processing at speed

What? The R ecosystem doesn't provide meaningful out of core capabilities, nevermind the ability to handle anything approaching 'massive amounts of data'.

-- Would sure love to know why an agenda-less factual comment is getting downvoted.

In my experience, R is really fast since I t was designed to store data in columnar format which we now all know is best for data analysis. So, in most cases, scaling up computation is quite easy. To scale out, you can use Apache Spark with R, the interface I’ve worked on, sparklyr is quite easy to use and allows you to scale out computation. Just to give you an example of what’s possible, I was playing around yesterday with a ray tracing prototype someone is building and scaled it out in Spark, see https://twitter.com/javierluraschi/status/112055769372135424... — it’s a misconception that R is slow or can’t scale.
You can plug any compute kernel you want into spark, that's not a pro or con of R.

Column stores are standard in any analytics pipeline today. They make up Python's Pandas, R's dplyr, and Java's DataFrame. How or why does R stand out for 'massive amounts of data'?

R does not have have meaningful out of core compute offerings that compare with something like Dask.

R does not at all have cluster compute offerings that compare to Dask Distributed.

If you want to know what real performance looks like, check out Python's cudf which will shortly fully match the Pandas api. That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.

Whatever advantages R has, perf or scalability are definitely not amongst them.

You are arguing for Python and speed in the same breath? If you want portable speed, you better "warm up a chair" and master Fortran.

Bonus: modern Fortran is a joy to develop in, far more fun than Python. And you get to compile to machine code, either for a processor or a GPU.

> That raytracing example you linked would run at interactive rates with cudf, I really don't see any basis for perf arguments in R's favour, and 'massive data' arguments are laughable here.

I don't see how the "GPU DataFrames" provided in cuDF would enhance a raytracer in any way.

Is there anything comparable to tidyverse and ggplot2 in python? If so I will switch immediately.
To answer your question:

ggplot2 : plotnine is quite good ggplot2 clone based on matplotlib. I feel like ggplot2 is a bit better and more complete, but if you want to do something that isn't supported it's harder for me to hack than matplotlib.

Tidyverse: To me, ggplot2 is the only essential part of the tidyverse. Lubridate is also good. Most others seem like semantics and syntax sugar. I prefer data.table, which is similar to Pandas. DT is super fast but imho Pandas has a more intuitive and consistent API (and if you want a speed up for large N then dask might work).

I use both R and Python on a regular basis. I choose Python for lower-level stuff, automation, parallelism / concurrency, and R for bespoke statistics. I use both for everyday statistics and plotting, but I feel that R has light advantages. I feel like if you're comfortable switching languages there are good reasons to use both. It's also important for me because I work with different teams that have different practices and preferences.

I don't know if it is still a thing, but if you are working with SAP HANA (in-memory database) there is a good chance you would like to learn R as they integrated it into their database.
Related: tidyverse's dbplyr let's you write tidyverse code querying almost any remote database - leveraging the DBs computation while writing code (almost) as you would for a local data frame. In my old job I got to a point where I would barely ever need to write sql anymore because of this. https://cran.r-project.org/web/packages/dbplyr/vignettes/dbp...
Vertica did as well.
All those general things you can do in R as well. Maybe not as well developed and widely used as the Python counterparts but definitily there.