Hacker News new | ask | show | jobs
Show HN: Querying JSON documents using SQL-like language in Scala (github.com)
70 points by mmalek06 2310 days ago
9 comments

As people always do in such threads, I’d too like to chime in with alternatives I’ve been a co-author of, as this is an area I’m very interested in.

The first one, which is definitely close in spirit to this project is OctoSQL[1]. We too are trying to query all the things with SQL, including JSON, though we extended the idea to supporting more data sources and file formats, with the ability to mix and match in joins and subqueries. Curious what direction this project will take though, as we’re now heading for streaming sources (like Kafka) in pure SQL!

A totally different one is jql[2] where I’ve been exploring more of a lispy continuation passing style based approach to a JSON query language, as I didn’t really feel like the current ones are ergonomic.

If you like this, make sure to check them out too! (And all the others people will be posting, as I think they all are fascinating! This is a big area with lots of space for innovation left.)

[1]: https://github.com/cube2222/octosql

[2]: https://github.com/cube2222/jql

OctoSQL looks very nice and I’ll check it out soon. Do you have Scala bindings for it?
No, though people have contributed R and Python bindings already. So one may hope :) though Go is fairly unfriendly to binding from other languages (because of the runtime overhead).

We ourselves mainly work on the command line interface and concentrate on it.

Ah, I get it - it’s an uber ORM!
Nah, really just a command line sql query tool. Though could probably be used as an uber orm.
This is cool, thanks for sharing. Octosql reminds me of Prestosql but as a CLI.
Indeed, presto has been one of our main inspirations!
Have you explored supporting Parquet?
There’s a WIP pull request for it, so it should be here soon.
I've been experimenting with the combo of the "jq" JSON query language and SQL via a custom SQLite function. I bundled the result as a plugin for my Datasette project: https://github.com/simonw/datasette-jq

Here's an example query: https://datasette-jq-demo.datasette.io/demo?sql=select%0D%0A...

Apache Drill [1] is alternative, using ANSI SQL. Beside json, it support JDBC, csv, parquet, file system traversal, HBase, Mongo, S3... I believe it also support joins across different sources, although I haven't tried that.

EDIT: to add context to the story - Apache Drill is running on JVM and can be embedded, so it can be run from Scala code as well.

[1] https://drill.apache.org/docs/querying-json-files/

That looks nice. Can it be used in a program? I mean, not from cli, but inside code?
I recently published a github project with the purpose mentioned in the post's title. Obviously if you just want to parse JSON, there are much better Scala libraries, like Circe. However in my case, I had to filter responses from a very dynamic api that I couldn't change in any way (for example to give me prefiltered results). My first choice was to use JsonXPath Java lib, but I didn't like that XPath-like notation, so I figured that some people may benefit from using what most of us already know - SQL - to create them queries. In my project, it was supposed to be configurable - the architect wanted to enable the not so technical business bunch to query json documents almost without programmers help and the easiest way to do it was to put their queries into the database and them allow them to match a query against an api. I left the project some time ago, but the idea I had back them kept me awake at night and here it is. Hopefully not entirely very bad ;) If you like it, I could really appreciate some help with development of the rest of some even cooler functionalities.
IMHO nothing beats jq[1]. There's Pyjq[2] for Python.

[1] https://stedolan.github.io/jq/ [2] https://github.com/doloopwhile/pyjq

I personally find jq to be extremely complicated once you move beyond simple queries.
After having used jq for some heavy data processing (building an entire Markdown document from a huge slab of JSON with no external tools), I am inclined to agree.

However, I think this may be indicative of the task you are trying to complete rather than the tool itself. At a certain level of complexity your resulting code will just look bad regardless of which language you use. One thing jq does to make this easier is providing the pipe operator, so you can break a complex task into stages: A | B | (C .... | C.5) | D | E...

You could try my Xidel, based on XPath and JSONiq ( http://www.videlibri.de/xidel.html ).

It is more verbose for simple queries, but with named variables and functions, larger queries do not become complicated

Xidel is my immediate pick for querying HTML. I use it in several scripts. Thanks!

I still prefer jq for JSON processing though. I'm hooked to pipelines. EDIT: The great power of jq is in transforming documents, not querying.

You should try out GROQ (https://groq.dev) then!
You can use a separate file to define your jq program, and even define functions to make it more readable.
I figured, I'll give an update to anyone who's still interested. I managed to fix a few issues and and some major functionalities (eg: * symbol to select all) over the last days. The next thing I'll work on will be type queries, so that the user will be able to use queries as strings but also as a chain of method calls with a type parameter representing the model returned from the query. Something like: from(jsonObject).where(list of filter functions).select[CaseClassType]() . Stay tuned!
SQL isn't the good part about relational databases, its the relational model that is the good part. The ability to query your data from multiple different angles.

In saying that, I guess its still better than having yet another custom query language a la mongo.

I don't agree :). SQL nicely abstracts what would have been more difficult or technical to do otherwise. Yep, it was create in order to enable non strictly technical people to query their data, but does it make it bad? I don't think so. Also authors of great many NoSql databases (and other technologies, ever heard about the widely loved LINQ technology in C#?) think, giving their users the possibility to query data using what they already know. Also: can you honestly, from the bottom of your heart, say that you'd rather filter json documents using some form of XPath? Besides: I'm not trying to criticize relational model, as well as I'm not saying that JSON is the way to go when storing data :D I just wanted to create a tool for people that are trying to implement a very specific business case I described and, due to lack of tools, are having some issues with it.
> Also: can you honestly, from the bottom of your heart, say that you'd rather filter json documents using some form of XPath?

Come one read the second line of my reply before strawmanning.

Ok, maybe I got too defensive and assigned to much weight to the first part of your first sentence. No strawmanning intended :)
Also: please look at my first comment, from 13 hours ago - it should shed some light on the "why" part.
This stringified approach loses type safety and IDE assistance for what benefit? Why do we want it to look like SQL? A Slick/Quill-like approach seems much safer.
Please read my description (found somewhere in this discussion). That being said: there are scenarios when being able to save a query into a database or config file, then load and run it is useful. In general I agree though and that’s why one of the enhabcements I want to implement is to create a method based syntax with projecting selected fields to a case class.
As for the „why do we want it to look like sql” - I like the syntax and I think it’s much more readable than any other language used to process data. But that’s just a matter of taste :)
Pretty cool. You can do this in apache spark already but this looks a lot quicker to get going with.
Didn’t know about it. Is it possible to use only that subset of apache spark lib?
I imagine you could reuse Catalyst to generate queries against JSON derived Datasets with some work. Starting point for some reading maybe? https://github.com/apache/spark/blob/e2d3983de78f5c80fac066b...
If you need to parse semi-structured documents using Spark, Quenya-DSL is much better than SQL, specially if you want to flatten the data.

https://github.com/music-of-the-ainur/quenya-dsl

Yes, you can run apache spark as a single node really easily. Then once its running you can fire up the Scala or Python shell that it comes with. After that, it is just a matter of issuing the statements to setup the data set then issues queries against it.
Not many people know this but Databricks offers https://community.cloud.databricks.com/ for free which allows you to run simple spark notebooks.

Disclosure: works for Databricks but not on spark

Still, it seems like a lot of work. My case was that I just wanted to feed JSON document into some mechanism that will give me back what I want. I wouldn't like to embed whole Spark framework into my app. I will read about it though :)