|
|
|
|
|
by topper-123
2072 days ago
|
|
I'm a pandas core developer and this is very interesting to me. That `groupby.apply` is a lot slower than `groupby.agg` does not surprise me at all: `groupby.apply` can do a lot of things that `groupby.agg` can't do, at the cost of being potentially a lot slower. In general, `groupby.apply` should only be used, when `groupby.agg` can't do the job. However, are you saying that pandas's `groupby.agg` is faster than r's data.table, julia and clojure? That surprises me a lot. |
|
One possible explanation I could think of is that Pandas support for Parquet is pretty good compared to data.table and Julia. I've been asked to split the read/write part and the groupby-agg part for a more complete picture. I'll be sure to work on that in the coming weeks.
Another hypothesis by u/joinr about why Pandas performs better in the smaller dataset:
"I wonder if there's some default column size allocation that happens up front for the 2^6 case that helps prevent growth in pandas, and maybe the hueristic falls down a little as the dataset gets larger leading to more resizing."