I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].
Datafusion out performs spark by a large margin. It is on par with photon based on my experiences, see benchmarks at https://github.com/blaze-init/blaze.
Ah nice, thank you for sharing that. I hadn’t seen it before, and congrats on beating out Spark that hard, I hope it continues to improve!
As an aside, maybe it would make sense to publish a new blog post somewhere so that the top hit on Google for “DataFusion benchmark” isn’t that post I linked.
Haha, yeah, we should definitely put a little bit more efforts into SEO :) Everyone is so focused on the hard-core engineering at the moment. I think Matthew from the community is actually working on a new comprehensive benchmark for us at the moment, which I hope will be published soon.