|
|
|
|
|
by akhong
2072 days ago
|
|
For this particular data and on my machine, that was certainly the case! I've been shown other benchmark results (such as this one: https://h2oai.github.io/db-benchmark/) that demonstrate otherwise. I'm not really sure what to make of it - maybe try more cases? One possible explanation I could think of is that Pandas support for Parquet is pretty good compared to data.table and Julia. I've been asked to split the read/write part and the groupby-agg part for a more complete picture. I'll be sure to work on that in the coming weeks. Another hypothesis by u/joinr about why Pandas performs better in the smaller dataset: "I wonder if there's some default column size allocation that happens up front for the 2^6 case that helps prevent growth in pandas, and maybe the hueristic falls down a little as the dataset gets larger leading to more resizing." |
|
Maybe one-factor groupbys are faster in pandas, while two-factor groupbys (as in https://h2oai.github.io/db-benchmark/) are slower?