Hacker News new | ask | show | jobs
by makmanalp 3118 days ago
> pandas is absolutely terrible compared to the dplyr, data.table, or even base R for data manipulation.

I would really like to hear a bit more about this, because this would greatly increase my motivation to learn more R. Specifically I've fiddled around with dplyr and it definitely feels more DSL-y but I didn't see a crazy benefit there. What are some of your favourite things about dplyr / data.table?

2 comments

Took me a while to get back to you, but essentially dplyr is fantastic for readability and reproducibility. Reading through someone else's analysis, or even my own long after the fact, is orders of magnitude easier than base R, data.table, or pandas typically are.

data.table's advantage lies in its speed. It is by far the fastest of the three options. In just about every benchmark it either is significantly faster than pandas or at the very least is approximately equal.

Pandas is lauded by people who strictly use Python, and it really is fantastic considering how ridiculous data manipulation would be in Python without it. But its also the only option a Python user really has, so they've become married to the idea that it is best.

Basically, if you are using Python, use pandas. If you have an option, go for data.table for speed, dplyr for clarity, or a mix of the two if desired.

What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you. As an example, these two statements are equivalent:

mydf$newvar <- with(mydf, oldvar1/oldvar2)

mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2)

You can then use the pipe operator %>% to funnel the results of one operator into the next.

The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.

Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).

However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.

So in pandas it's kinda similar:

> df[newvar] = df[oldvar1] / df[oldvar2]

And instead of the pipe, we have chaining for which is super straightforward and readable:

> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]

and for more complex or non-chainable functions we have .pipe:

https://pandas.pydata.org/pandas-docs/stable/generated/panda...

which looks super similar to dplyr to me!

the data.table way:

mydt[, newvar := oldvar1/oldvar2]

I could not resist.