Hacker News new | ask | show | jobs
by marsupialtail_2 1258 days ago
I think the blog post should point out very early that Onehouse is a Hudi company. There are some other recent benchmarks published in CIDR by Databricks that might paint a different picture: https://petereliaskraft.net/res/cidr_lakehouse.pdf
4 comments

Thanks for the link. I'd be interested to see a perf comparison using a popular processing engine other than spark given the obvious potential for delta lake to be better tuned for spark workloads by default.
me too. Trino for one would be a good start. Adding support for those data lakes is really hard though if you want good performance.
In Databricks published benchmark of course Delta is the fastest. I have also seen some Iceberg using company publishing benchmarks showing how Iceberg is the fastest.

Vendor published benchmarks are worthless.

I think vendor published benchmarks are fine if the dataset is open / accessible, the benchmark code is published, all software versions are disclosed, and the exact hardware is specified. I definitely wouldn't consider an audited TPC benchmark that's based on industry standard datasets / queries worthless in the data space. Disclosure: I work for Databricks.
fwiw - the lead authors on that linked paper are all grad students not employed at Databricks. That being said, they're advised by Databricks people
It looks like the benchmarks used the latest versions of Delta and Iceberg, but chose a version of Hudi that is over 6 months old. Hudi v0.12.2 is more advanced than v0.12.0 which the benchmark did not consider. As the Databricks CIDR paper states, and as mentioned in the Onehouse article, Hudi by default is optimized for UPSERTs vs INSERTs and is a 1-line config change that is appropriate for a true apples-apples comparison. See both: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-trans... and https://github.com/brooklyn-data/delta/pull/2
Hah, I could tell because in the "feature matrix" the Hudi column was mostly green compared to the others. Immediately made me suspicious so I looked it up and sure enough, not exactly an unbiased source.

Feature matrices are extremely easy to game depending on your choice of rows.

I recently evaluated these frameworks and went through all these links they have for each of those rows, on the first publish few months ago. FWIW I did not find any inaccuracies or wrong pointers.
Thank you, for that I’m sure entirely unbiased assessment on your newly-created account.

Feature matrices are fundamentally flawed for the reason that the GP gave.

it's funny how, on one hand you argue for objectivity but fundamentally distrust/write off a chance that someone could have created a hackernews account today and comment here - without a shred of evidence. May be now I am getting trained on the HN ways.