| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by akhong 2072 days ago

For this particular data and on my machine, that was certainly the case! I've been shown other benchmark results (such as this one: https://h2oai.github.io/db-benchmark/) that demonstrate otherwise. I'm not really sure what to make of it - maybe try more cases?

One possible explanation I could think of is that Pandas support for Parquet is pretty good compared to data.table and Julia. I've been asked to split the read/write part and the groupby-agg part for a more complete picture. I'll be sure to work on that in the coming weeks.

Another hypothesis by u/joinr about why Pandas performs better in the smaller dataset:

"I wonder if there's some default column size allocation that happens up front for the 2^6 case that helps prevent growth in pandas, and maybe the hueristic falls down a little as the dataset gets larger leading to more resizing."

1 comments

topper-123 2071 days ago

I can't imagine that `.to_parquet` takes any time at all, relative to `groupby.agg`. But yeah, It would be nice to get seperate benchnmarks for the two parts of your benchmark.

Maybe one-factor groupbys are faster in pandas, while two-factor groupbys (as in https://h2oai.github.io/db-benchmark/) are slower?

link

akhong 2071 days ago

Yes, I agree with you in Pandas' case. However, for other libraries, a good chunk of the run time comes from reading the parquet files and concatenating the partial datasets. Pandas and Spark are particularly really good with reading a directory of 12 Parquet files with no noticeable performance penalty.

link