| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by marsupialtail_2 1258 days ago
	I think the blog post should point out very early that Onehouse is a Hudi company. There are some other recent benchmarks published in CIDR by Databricks that might paint a different picture: https://petereliaskraft.net/res/cidr_lakehouse.pdf

4 comments

anonymousDan 1258 days ago

Thanks for the link. I'd be interested to see a perf comparison using a popular processing engine other than spark given the obvious potential for delta lake to be better tuned for spark workloads by default.

link

marsupialtail_2 1258 days ago

me too. Trino for one would be a good start. Adding support for those data lakes is really hard though if you want good performance.

link

glogla 1258 days ago

In Databricks published benchmark of course Delta is the fastest. I have also seen some Iceberg using company publishing benchmarks showing how Iceberg is the fastest.

Vendor published benchmarks are worthless.

link

MrPowers 1258 days ago

I think vendor published benchmarks are fine if the dataset is open / accessible, the benchmark code is published, all software versions are disclosed, and the exact hardware is specified. I definitely wouldn't consider an audited TPC benchmark that's based on industry standard datasets / queries worthless in the data space. Disclosure: I work for Databricks.

link

mostdataisnice 1258 days ago

fwiw - the lead authors on that linked paper are all grad students not employed at Databricks. That being said, they're advised by Databricks people

link

sla99 1255 days ago

It looks like the benchmarks used the latest versions of Delta and Iceberg, but chose a version of Hudi that is over 6 months old. Hudi v0.12.2 is more advanced than v0.12.0 which the benchmark did not consider. As the Databricks CIDR paper states, and as mentioned in the Onehouse article, Hudi by default is optimized for UPSERTs vs INSERTs and is a 1-line config change that is appropriate for a true apples-apples comparison. See both: https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-trans... and https://github.com/brooklyn-data/delta/pull/2

link

lukev 1258 days ago

Hah, I could tell because in the "feature matrix" the Hudi column was mostly green compared to the others. Immediately made me suspicious so I looked it up and sure enough, not exactly an unbiased source.

Feature matrices are extremely easy to game depending on your choice of rows.

link

cloud8bits 1257 days ago

I recently evaluated these frameworks and went through all these links they have for each of those rows, on the first publish few months ago. FWIW I did not find any inaccuracies or wrong pointers.

link

seanhunter 1257 days ago

Thank you, for that I’m sure entirely unbiased assessment on your newly-created account.

Feature matrices are fundamentally flawed for the reason that the GP gave.

link

cloud8bits 1257 days ago

it's funny how, on one hand you argue for objectivity but fundamentally distrust/write off a chance that someone could have created a hackernews account today and comment here - without a shred of evidence. May be now I am getting trained on the HN ways.

link