| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cube2222 1366 days ago

SPyQL is really cool and its design is very smart, with it being able to leverage normal Python functions!

As far as similar tools go, if you're interested, I recommend taking a look at DataFusion[0], dsq[1], and OctoSQL[2].

DataFusion is a very (very very) fast command-line SQL engine but with limited support for data formats.

dsq is based on SQLite which means it has to load data into SQLite first, but then gives you the whole breath of SQLite, it also supports many data formats, but is slower at the same time.

OctoSQL is faster, extensible through plugins, and supports incremental query execution, so you can i.e. calculate and display a running group by + count while tailing a log file. It also supports normal databases, not just file formats, so you can i.e. join with a Postgres table.

[0]: https://github.com/apache/arrow-datafusion

[1]: https://github.com/multiprocessio/dsq

[2]: https://github.com/cube2222/octosql

Disclaimer: Author of OctoSQL

3 comments

mytherin 1366 days ago

You may also want to have a look at the DuckDB command line client [1]. The shell itself is based on the SQLite client, and DuckDB can be used to natively query CSV and Parquet files. Using extensions, DuckDB can also query SQLite and Postgres databases, and query files over HTTPS and S3.

The command line client also has some nifty features like syntax highlighting, and context-aware auto-complete that is coming in the next release.

[1] https://duckdb.org/docs/installation/

Disclaimer: working on DuckDB

link

Beltalowda 1366 days ago

One thing I really miss in the DuckDB is that it doesn't insert the entire query if you use C-p/Up arrow for multi-line queries (it just cycles through the lines of the query). This behaviour is inherited from SQLite, and it trips me up every time even after years of SQLite CLI usage.

link

dmoura 1366 days ago

DuckDB is great! I love what you guys are building. The main gap for me is native support of JSON (lines), like you have for CSV and Parquet.

link

zX41ZdbW 1366 days ago

Here is a comparison of various SQL engines in command line: https://github.com/dcmoura/spyql/blob/master/notebooks/json_...

link

mildbyte 1366 days ago

And if you're looking for a similar experience (very fast analytical SQL queries) but over HTTP, for example, to power a public dashboard or a visualization, you can try ROAPI [0] or Seafowl [1], also built on top of DataFusion (disclaimer: working on Seafowl):

[0]: https://github.com/roapi/roapi

[1]: https://github.com/splitgraph/seafowl

link

cube2222 1366 days ago

That's what I thought about ROAPI as well, until I benchmarked it, and it ended up being very slow[0].

[0]: https://news.ycombinator.com/item?id=32970495

link

mildbyte 1366 days ago

It could be the NDJSON parser (DF source: [0]) or could be a variety of other factors. Looking at the ROAPI release archive [1], it doesn't ship with the definitive `columnq` binary from your comment (EDIT: it does, I was looking in the wrong place! https://github.com/roapi/roapi/releases/tag/columnq-cli-v0.3...), so it could also have something to do with compilation-time flags.

FWIW, we use the Parquet format with DataFusion and get very good speeds similar to DuckDB [2], e.g. 1.5s to run a more complex aggregation query `SELECT date_trunc('month', tpep_pickup_datetime) AS month, COUNT(*) AS total_trips, SUM(total_amount) FROM tripdata GROUP BY 1 ORDER BY 1 ASC)` on a 55M row subset of NY Taxi trip data.

[0]: https://github.com/apache/arrow-datafusion/blob/master/dataf...

[1]: https://github.com/roapi/roapi/releases/tag/roapi-v0.8.0

[2]: https://observablehq.com/@seafowl/benchmarks

link

cube2222 1366 days ago

Yes, DataFusion itself is definitely fast, no denying that.

link