|
|
|
|
|
by cube2222
1366 days ago
|
|
This looks really cool! Especially using datafusion underneath means that it probably is blazingly fast. If you like this, I recommend taking a look at OctoSQL[0], which I'm the author of. It's plenty fast and easier to add new data sources for as external plugins. It can also handle endless streams of data natively, so you can do running groupings on i.e. tailed JSON logs. Additionally, it's able to push down predicates to the database below, so if you're selecting 10 rows from a 1 billion row table, it'll just get those 10 rows instead of getting them all and filtering in memory. [0]: https://github.com/cube2222/octosql |
|
> blazingly fast
I’m going to need to see a citation for that. Last I checked, it was being beaten by Apache Spark in non-memory constrained scenarios [0]. This may be “blazingly fast” compared to Pandas or something, but it’s still leaving a TON of room on the table performance-wise. There’s a reason why Databricks found it necessary to redirect their Spark backend to a custom native query engine [1].
[0] https://andygrove.io/2019/04/datafusion-0.13.0-benchmarks/
[1] https://cs.stanford.edu/~matei/papers/2022/sigmod_photon.pdf