Hacker News new | ask | show | jobs
by cube2222 1366 days ago
This looks really cool! Especially using datafusion underneath means that it probably is blazingly fast.

If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of.

It's plenty fast and easier to add new data sources for as external plugins.

It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs.

Additionally, it's able to push down predicates to the database below, so if you're selecting 10 rows from a 1 billion row table, it'll just get those 10 rows instead of getting them all and filtering in memory.

[0]: https://github.com/cube2222/octosql

2 comments

> datafusion

> blazingly fast

I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].

[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/

[1] https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf

Datafusion out performs spark by a large margin. It is on par with photon based on my experiences, see benchmarks at https://github.com/blaze-init/blaze.
Ah nice, thank you for sharing that. I hadn’t seen it before, and congrats on beating out Spark that hard, I hope it continues to improve!

As an aside, maybe it would make sense to publish a new blog post somewhere so that the top hit on Google for “DataFusion benchmark” isn’t that post I linked.

Haha, yeah, we should definitely put a little bit more efforts into SEO :) Everyone is so focused on the hard-core engineering at the moment. I think Matthew from the community is actually working on a new comprehensive benchmark for us at the moment, which I hope will be published soon.
I will update these old pages on my blog and redirect them!
Ok, I have now actually benchmarked this roapi CLI on the Amazon Review Dataset and it's over 20x slower than OctoSQL.

A simple group by

  time columnq sql --table books_10m.ndjson "SELECT AVG(overall) FROM books_10m"
takes 66 seconds.

The equivalent in OctoSQL takes less than 3 seconds.

I retract my statement about this project being blazingly fast, though I imagine it's just the JSON parser that requires optimization.