Hacker News new | ask | show | jobs
by ploika 3115 days ago
I've been a heavy R user for about 7 years, and I only slightly disagree with one of your points.

(In my opinion) R is best for traditional statistics, as opposed to AI, machine learning, predictive analytics, data science, data analysis or any other variant thereof.

If you're more concerned with Chi-squared tests than unit tests, or if you need to teach a mathematician or a biologist how to fit regression models and analyse residuals, goodness-of-fit statistics, p-values etc, then R is the best language for the job.

If you need to build a program (as opposed to just do a thing), or if you're more interested in accuracy than inference (as per most machine learning tasks), then Python with sklearn and pandas blows R out of the water.

2 comments

> Python with sklearn and pandas blows R out of the water.

For some things yes, but for others the reverse is true. I'm also a heavy R and python user and find the two ecosystems extremely complementary. For building pipelines and web apps, python has an edge. For statistics, graphics, and data management, R is IMO superior. You can do everything in either language, but have to jump through hoops in some cases. Sometimes the best solution is use both!

For example, I run an internal web app for A/B testing using django and rpy2. Doing it all in python would have been sub-optimal because dataset management is so much simpler in R. Plots that were easy to do in ggplot2 were impossible to get right in matplotlib. The big drawback to this method is R's single-threaded architecture. Embedding R in a web server process is not easy (ask me!), and won't scale as well as a multi-threaded environment can.

All my data exploration and prototyping happens in R. Even basic report scripting can be done better in R than python because of the ease of data management. Consider a typical case of 1) run database query, 2) munge data around to produce a table, and 3) email or save to html. If you can't get exactly what you want from the database in one query and you have to do a lot of munging in step 2, then R is going to be more flexible than python. If I need to merge, aggregate, or recode variables, I would much rather use R. Doing all this with a list of lists "dataset" in python is convoluted at best, and recreating a lot of the functionality that base R gives you.

Do you not use pandas?
pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation. And while you would have been right about Python being better for machine learning a couple of years ago, these days basically every popular machine learning library in Python (Tensorflow, keras, etc.) now has an API in R.

I also don't know why you are separating "traditional statistics", "predictive analytics", and "data analysis". They often are the exact same thing. In fact, it makes me wonder how much experience you have with statistics if you are under the impression that it is somehow different from data analysis "or any other variant thereof".

You are right on exactly one count: Python is superior for putting data analytics into production. And that isn't an insignificant advantage. A lot of data science today involves packaging an analysis into some larger program or product, and Python is absolutely better suited to that task.

But in virtually every other case (including lots of machine learning problems), R is either as good if not greatly superior to Python.

I did start my post with the words "in my opinion". I am not right or wrong about anything, and neither are you. We're mostly talking about syntax preferences here.

I'm separating out traditional statistics as an alias for statistical inference - make distributional assumptions, test them, estimate the effect of X on y and put a 95% confidence interval around it. That sort of stuff.

It's the stuff that absolutely does not matter if you're assessing the overall effectiveness of a classifier, and certainly isn't needed in a lot of data analysis tasks where all you need are variations of counts and percentages.

For the record, my academic background is maths and statistics. I've picked up any software development experience on the job.

> pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation.

I would really like to hear a bit more about this, because this would greatly increase my motivation to learn more R. Specifically I've fiddled around with dplyr and it definitely feels more DSL-y but I didn't see a crazy benefit there. What are some of your favourite things about dplyr / data.table?

Took me a while to get back to you, but essentially dplyr is fantastic for readability and reproducibility. Reading through someone else's analysis, or even my own long after the fact, is orders of magnitude easier than base R, data.table, or pandas typically are.

data.table's advantage lies in its speed. It is by far the fastest of the three options. In just about every benchmark it either is significantly faster than pandas or at the very least is approximately equal.

Pandas is lauded by people who strictly use Python, and it really is fantastic considering how ridiculous data manipulation would be in Python without it. But its also the only option a Python user really has, so they've become married to the idea that it is best.

Basically, if you are using Python, use pandas. If you have an option, go for data.table for speed, dplyr for clarity, or a mix of the two if desired.

What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you. As an example, these two statements are equivalent:

mydf$newvar <- with(mydf, oldvar1/oldvar2)

mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2)

You can then use the pipe operator %>% to funnel the results of one operator into the next.

The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.

Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).

However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.

So in pandas it's kinda similar:

> df[newvar] = df[oldvar1] / df[oldvar2]

And instead of the pipe, we have chaining for which is super straightforward and readable:

> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]

and for more complex or non-chainable functions we have .pipe:

https://pandas.pydata.org/pandas-docs/stable/generated/panda...

which looks super similar to dplyr to me!

the data.table way:

mydt[, newvar := oldvar1/oldvar2]

I could not resist.