|
|
|
|
|
by disgruntledphd2
3113 days ago
|
|
What I really like about dplyr is how simple it is. It essentially provides an SQL like selection of verbs (select, mutate, summarise, arrange) and handles lots of things for you.
As an example, these two statements are equivalent: mydf$newvar <- with(mydf, oldvar1/oldvar2) mydf <- dplyr::mutate(mydf, newvar=oldvar1/oldvar2) You can then use the pipe operator %>% to funnel the results of one operator into the next. The real advantages is that you can easily build up a selection of functions which can be read from left to right (rather than right to left in summary(coef(mylm))) and the reduction in temporary variables. Pandas, on the other hand looks like base R (which is fine, but not as nice as dplyr). However, the niceness of pipes does all fall apart when you have an error in the middle and you need to start deleting things in order to debug. |
|
> df[newvar] = df[oldvar1] / df[oldvar2]
And instead of the pipe, we have chaining for which is super straightforward and readable:
> df[newvar] = (df[oldvar1] / df[oldvar2]).abs().rank().astype(str).str[:4]
and for more complex or non-chainable functions we have .pipe:
https://pandas.pydata.org/pandas-docs/stable/generated/panda...
which looks super similar to dplyr to me!