Hacker News new | ask | show | jobs
by disgruntledphd2 3113 days ago
What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you. As an example, these two statements are equivalent:

mydf$newvar <- with(mydf, oldvar1/oldvar2)

mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2)

You can then use the pipe operator %>% to funnel the results of one operator into the next.

The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables.

Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr).

However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug.

2 comments

So in pandas it's kinda similar:

> df[newvar] = df[oldvar1] / df[oldvar2]

And instead of the pipe, we have chaining for which is super straightforward and readable:

> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]

and for more complex or non-chainable functions we have .pipe:

https://pandas.pydata.org/pandas-docs/stable/generated/panda...

which looks super similar to dplyr to me!

the data.table way:

mydt[, newvar := oldvar1/oldvar2]

I could not resist.