Hacker News new | ask | show | jobs
by closed 3349 days ago
You're not alone! I think pandas made some design decisions around their transformation functions that make it a lot more cumbersome to use than R's dplyr. It's not obvious from the documentation, though.

As an example from the pandas docs [1], in dplyr you can do

> gdf <- group_by(df, col1)

> summarise(gdf, avg=mean(col1))

In pandas this is similar to

> df.groupby('col1').agg({'col1': 'mean'})

But dplyr's summarize it's much more flexible than agg, as you can do all kinds of things to any number of columns. E.g.

> summarise(gdf, some_name = f1(col1) + f2(col2))

But in pandas you can apply 1 function to 1 column with agg.

[1] http://pandas.pydata.org/pandas-docs/stable/comparison_with_...

4 comments

Swings and roundabouts. I'm a big fan of dplyr, and R definitely does some thing better than Pandas, but I've never found anything as flexible as pd.pivot_table for cross tabulations. For instnace, the lack of multiindexing in R is a big drawback.
you can supply a map, ie:

gdf = df.groupby('col1').agg({'col2': np.mean, 'col3':np.std,'col4': lamba x: np.mean((x) / np.std(x))})

once you've got your grouped dataframe, go nuts

gdf['some_name'] = gdf['col1'].apply(f1) + gdf['col2'].apply(f2)

The summarise example I gave creates a single new column (some_col), that is a function of two columns from the grouped data frame. Passing a map to agg is just creating multiple columns, each a function of at most a single column in the dataframe.

(I should have used a better example, like summarise(gdf, some_col = f(cola / colb))

Totally a preference thing. I strongly prefer pandas to dplyr having worked with both.
You can supply a map to agg, something like {col1:'sum', col2: 'mean',...}
Yes, but it is still applying that function to a single column (the summarise example I gave could aggregate multiple columns into a single result, e.g. sum(col1 / col2))