| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sakras 1406 days ago

> datafusion

> blazingly fast

I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].

[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/

[1] https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf

1 comments

houqp 1406 days ago

Datafusion out performs spark by a large margin. It is on par with photon based on my experiences, see benchmarks at https://github.com/blaze-init/blaze.

link

sakras 1406 days ago

Ah nice, thank you for sharing that. I hadn’t seen it before, and congrats on beating out Spark that hard, I hope it continues to improve!

As an aside, maybe it would make sense to publish a new blog post somewhere so that the top hit on Google for “DataFusion benchmark” isn’t that post I linked.

link

houqp 1406 days ago

Haha, yeah, we should definitely put a little bit more efforts into SEO :) Everyone is so focused on the hard-core engineering at the moment. I think Matthew from the community is actually working on a new comprehensive benchmark for us at the moment, which I hope will be published soon.

link

andygrove 1405 days ago

I will update these old pages on my blog and redirect them!

link