| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thousandautumns 3118 days ago

pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation. And while you would have been right about Python being better for machine learning a couple of years ago, these days basically every popular machine learning library in Python (Tensorflow, keras, etc.) now has an API in R.

I also don't know why you are separating "traditional statistics", "predictive analytics", and "data analysis". They often are the exact same thing. In fact, it makes me wonder how much experience you have with statistics if you are under the impression that it is somehow different from data analysis "or any other variant thereof".

You are right on exactly one count: Python is superior for putting data analytics into production. And that isn't an insignificant advantage. A lot of data science today involves packaging an analysis into some larger program or product, and Python is absolutely better suited to that task.

But in virtually every other case (including lots of machine learning problems), R is either as good if not greatly superior to Python.

2 comments

ploika 3118 days ago

I did start my post with the words "in my opinion". I am not right or wrong about anything, and neither are you. We're mostly talking about syntax preferences here.

I'm separating out traditional statistics as an alias for statistical inference - make distributional assumptions, test them, estimate the effect of X on y and put a 95% confidence interval around it. That sort of stuff.

It's the stuff that absolutely does not matter if you're assessing the overall effectiveness of a classifier, and certainly isn't needed in a lot of data analysis tasks where all you need are variations of counts and percentages.

For the record, my academic background is maths and statistics. I've picked up any software development experience on the job.

link

makmanalp 3118 days ago

> pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation.

I would really like to hear a bit more about this, because this would greatly increase my motivation to learn more R. Specifically I've fiddled around with dplyr and it definitely feels more DSL-y but I didn't see a crazy benefit there. What are some of your favourite things about dplyr / data.table?

link

thousandautumns 3117 days ago

Took me a while to get back to you, but essentially dplyr is fantastic for readability and reproducibility. Reading through someone else's analysis, or even my own long after the fact, is orders of magnitude easier than base R, data.table, or pandas typically are.

data.table's advantage lies in its speed. It is by far the fastest of the three options. In just about every benchmark it either is significantly faster than pandas or at the very least is approximately equal.

Pandas is lauded by people who strictly use Python, and it really is fantastic considering how ridiculous data manipulation would be in Python without it. But its also the only option a Python user really has, so they've become married to the idea that it is best.

Basically, if you are using Python, use pandas. If you have an option, go for data.table for speed, dplyr for clarity, or a mix of the two if desired.

link

disgruntledphd2 3117 days ago

What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you. As an example, these two statements are equivalent:

mydf$newvar <- with(mydf, oldvar1/oldvar2)

mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2)

You can then use the pipe operator %>% to funnel the results of one operator into the next.

The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.

Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).

However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.

link

makmanalp 3117 days ago

So in pandas it's kinda similar:

> df[newvar] = df[oldvar1] / df[oldvar2]

And instead of the pipe, we have chaining for which is super straightforward and readable:

> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]

and for more complex or non-chainable functions we have .pipe:

https://pandas.pydata.org/pandas-docs/stable/generated/panda...

which looks super similar to dplyr to me!

link

vhhn 3116 days ago

the data.table way:

mydt[, newvar := oldvar1/oldvar2]

I could not resist.

link