Show HN: Querying JSON documents using SQL-like language in Scala | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Show HN: Querying JSON documents using SQL-like language in Scala (github.com)
	70 points by mmalek06 2310 days ago

9 comments

cube2222 2310 days ago

As people always do in such threads, I’d too like to chime in with alternatives I’ve been a co-author of, as this is an area I’m very interested in.

The first one, which is definitely close in spirit to this project is OctoSQL[1]. We too are trying to query all the things with SQL, including JSON, though we extended the idea to supporting more data sources and file formats, with the ability to mix and match in joins and subqueries. Curious what direction this project will take though, as we’re now heading for streaming sources (like Kafka) in pure SQL!

A totally different one is jql[2] where I’ve been exploring more of a lispy continuation passing style based approach to a JSON query language, as I didn’t really feel like the current ones are ergonomic.

If you like this, make sure to check them out too! (And all the others people will be posting, as I think they all are fascinating! This is a big area with lots of space for innovation left.)

[1]: https://github.com/cube2222/octosql

[2]: https://github.com/cube2222/jql

mmalek06 2310 days ago

OctoSQL looks very nice and I’ll check it out soon. Do you have Scala bindings for it?

cube2222 2310 days ago

No, though people have contributed R and Python bindings already. So one may hope :) though Go is fairly unfriendly to binding from other languages (because of the runtime overhead).

We ourselves mainly work on the command line interface and concentrate on it.

mmalek06 2310 days ago

Ah, I get it - it’s an uber ORM!

cube2222 2310 days ago

Nah, really just a command line sql query tool. Though could probably be used as an uber orm.

ecnahc515 2310 days ago

This is cool, thanks for sharing. Octosql reminds me of Prestosql but as a CLI.

cube2222 2310 days ago

Indeed, presto has been one of our main inspirations!

eatonphil 2310 days ago

Have you explored supporting Parquet?

cube2222 2310 days ago

There’s a WIP pull request for it, so it should be here soon.

simonw 2310 days ago

I've been experimenting with the combo of the "jq" JSON query language and SQL via a custom SQLite function. I bundled the result as a plugin for my Datasette project: https://github.com/simonw/datasette-jq

Here's an example query: https://datasette-jq-demo.datasette.io/demo?sql=select%0D%0A...

dig1 2310 days ago

Apache Drill [1] is alternative, using ANSI SQL. Beside json, it support JDBC, csv, parquet, file system traversal, HBase, Mongo, S3... I believe it also support joins across different sources, although I haven't tried that.

EDIT: to add context to the story - Apache Drill is running on JVM and can be embedded, so it can be run from Scala code as well.

[1] https://drill.apache.org/docs/querying-json-files/

mmalek06 2310 days ago

That looks nice. Can it be used in a program? I mean, not from cli, but inside code?

mmalek06 2310 days ago

I recently published a github project with the purpose mentioned in the post's title. Obviously if you just want to parse JSON, there are much better Scala libraries, like Circe. However in my case, I had to filter responses from a very dynamic api that I couldn't change in any way (for example to give me prefiltered results). My first choice was to use JsonXPath Java lib, but I didn't like that XPath-like notation, so I figured that some people may benefit from using what most of us already know - SQL - to create them queries. In my project, it was supposed to be configurable - the architect wanted to enable the not so technical business bunch to query json documents almost without programmers help and the easiest way to do it was to put their queries into the database and them allow them to match a query against an api. I left the project some time ago, but the idea I had back them kept me awake at night and here it is. Hopefully not entirely very bad ;) If you like it, I could really appreciate some help with development of the rest of some even cooler functionalities.

inshadows 2310 days ago

IMHO nothing beats jq[1]. There's Pyjq[2] for Python.

[1] https://stedolan.github.io/jq/ [2] https://github.com/doloopwhile/pyjq

bachmeier 2310 days ago

I personally find jq to be extremely complicated once you move beyond simple queries.

nexuist 2310 days ago

After having used jq for some heavy data processing (building an entire Markdown document from a huge slab of JSON with no external tools), I am inclined to agree.

However, I think this may be indicative of the task you are trying to complete rather than the tool itself. At a certain level of complexity your resulting code will just look bad regardless of which language you use. One thing jq does to make this easier is providing the pipe operator, so you can break a complex task into stages: A | B | (C .... | C.5) | D | E...

benibela 2310 days ago

You could try my Xidel, based on XPath and JSONiq ( http://www.videlibri.de/xidel.html ).

It is more verbose for simple queries, but with named variables and functions, larger queries do not become complicated

inshadows 2310 days ago

Xidel is my immediate pick for querying HTML. I use it in several scripts. Thanks!

I still prefer jq for JSON processing though. I'm hooked to pipelines. EDIT: The great power of jq is in transforming documents, not querying.

kmelve 2310 days ago

You should try out GROQ (https://groq.dev) then!

ecnahc515 2310 days ago

You can use a separate file to define your jq program, and even define functions to make it more readable.

mmalek06 2307 days ago

I figured, I'll give an update to anyone who's still interested. I managed to fix a few issues and and some major functionalities (eg: * symbol to select all) over the last days. The next thing I'll work on will be type queries, so that the user will be able to use queries as strings but also as a chain of method calls with a type parameter representing the model returned from the query. Something like: from(jsonObject).where(list of filter functions).select[CaseClassType]() . Stay tuned!

collyw 2310 days ago

SQL isn't the good part about relational databases, its the relational model that is the good part. The ability to query your data from multiple different angles.

In saying that, I guess its still better than having yet another custom query language a la mongo.

mmalek06 2310 days ago

I don't agree :). SQL nicely abstracts what would have been more difficult or technical to do otherwise. Yep, it was create in order to enable non strictly technical people to query their data, but does it make it bad? I don't think so. Also authors of great many NoSql databases (and other technologies, ever heard about the widely loved LINQ technology in C#?) think, giving their users the possibility to query data using what they already know. Also: can you honestly, from the bottom of your heart, say that you'd rather filter json documents using some form of XPath? Besides: I'm not trying to criticize relational model, as well as I'm not saying that JSON is the way to go when storing data :D I just wanted to create a tool for people that are trying to implement a very specific business case I described and, due to lack of tools, are having some issues with it.

collyw 2310 days ago

> Also: can you honestly, from the bottom of your heart, say that you'd rather filter json documents using some form of XPath?

Come one read the second line of my reply before strawmanning.

mmalek06 2310 days ago

Ok, maybe I got too defensive and assigned to much weight to the first part of your first sentence. No strawmanning intended :)

mmalek06 2310 days ago

Also: please look at my first comment, from 13 hours ago - it should shed some light on the "why" part.

danellis 2310 days ago

This stringified approach loses type safety and IDE assistance for what benefit? Why do we want it to look like SQL? A Slick/Quill-like approach seems much safer.

mmalek06 2308 days ago

Please read my description (found somewhere in this discussion). That being said: there are scenarios when being able to save a query into a database or config file, then load and run it is useful. In general I agree though and that’s why one of the enhabcements I want to implement is to create a method based syntax with projecting selected fields to a case class.

mmalek06 2308 days ago

As for the „why do we want it to look like sql” - I like the syntax and I think it’s much more readable than any other language used to process data. But that’s just a matter of taste :)

anon176 2310 days ago

Pretty cool. You can do this in apache spark already but this looks a lot quicker to get going with.

mmalek06 2310 days ago

Didn’t know about it. Is it possible to use only that subset of apache spark lib?

EdwardDiego 2310 days ago

I imagine you could reuse Catalyst to generate queries against JSON derived Datasets with some work. Starting point for some reading maybe? https://github.com/apache/spark/blob/e2d3983de78f5c80fac066b...

yami-sama 2310 days ago

If you need to parse semi-structured documents using Spark, Quenya-DSL is much better than SQL, specially if you want to flatten the data.

https://github.com/music-of-the-ainur/quenya-dsl

anon176 2310 days ago

Yes, you can run apache spark as a single node really easily. Then once its running you can fire up the Scala or Python shell that it comes with. After that, it is just a matter of issuing the statements to setup the data set then issues queries against it.

khc 2310 days ago

Not many people know this but Databricks offers https://community.cloud.databricks.com/ for free which allows you to run simple spark notebooks.

Disclosure: works for Databricks but not on spark

mmalek06 2310 days ago

Still, it seems like a lot of work. My case was that I just wanted to feed JSON document into some mechanism that will give me back what I want. I wouldn't like to embed whole Spark framework into my app. I will read about it though :)