| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by knowtheory 4808 days ago

I am mystified as to when this tool would be useful (can anybody else think of a practical use case?).

You can run SQL queries on individual JSON files... but you have to have PostgreSQL installed already.

And you can't run it against multiple JSON files, only one, with a single JSON object at the root.

5 comments

ozgune 4808 days ago

(Ozgun at Citus Data)

You're right that you need to have Postgres installed. For running SQL over multiple JSON files, we wanted to keep the blog post short and noted several different ways to go over multiple files in our GitHub Readme.

1. You can create a partitioned PostgreSQL table, and declare one child table per JSON file. You can also declare constraints on the child table to filter out irrelevant files to the query.

2. You can create a distributed foreign table on CitusDB, even if you'd like to run on one node. In this case, we'll collect statistics automatically for you, and filter out irrelevant files.

3. If neither of these fit, you could change the source code to scan over one directory instead of a file. We didn't go down this path to be compatible with file_fdw's semantics.

link

tracker1 4808 days ago

Just an idea... I had a script doing record dumps at a prior job... (there were about 30 joins for getting a single record (denormalized) into json), so having each one in a separate .json.gz file was a useful backup solution... it was about zero cost when doing mirroring to another system for query/display purposes.

In the file system, to prevent too many records in a single directory it was split up per 1000 records... base/00001000/(1000-1999).json.gz ... this was mainly for being able to navigate this structure via a gui. I would suggest if your system can't do "basepath//*.json.gz" that you consider it.

link

neilk 4808 days ago

Okay, so if I understand correctly, the benefit here is that I can run some kinds of SQL queries on a large collection of heterogenous JSON documents. Is that right? You're still speaking Postgres-internals-geek, not I-want-to-get-something-done geek.

That is:

- you can take heterogenous JSON documents, and then make them appear to the Postgres world as separate tables. (Which they have to be, because they're not uniform)

- you can then take some kinds of SQL queries and distribute them among these tables, as if they were one big SQL table. If some pseudo-tables have missing or extra columns, no problem.

link

ozgune 4808 days ago

Thanks for the comment. Yes, that's exactly the benefit.

You could in fact have heterogeneous JSON documents within one file, and json_fdw could still handle them. If a field you declared in the table schema doesn't appear in a JSON document, json_fdw would just consider that field to be Null.

The issue is, json_fdw is built such that one table can only associated with one JSON file, and users typically have many JSON files lying around. For example, you'd have one JSON file for each hour's worth of website error logs.

If that's the case, you could create a distributed table. That way, you could run the query one distributed table and have the query run on all your JSON files instead of just one.

That said, it's probably best to associate one type of JSON logs (say your website error logs) with one distributed table. That way, your queries go over your website error logs or your mobile application logs, but not both at the same time.

link

knowtheory 4808 days ago

It should be said I think that it's super cool that you guys are putting Postgres to broader use. More stuff can and should be done with it, i'm just not sure what to do with this particular tool :)

link

chaz 4808 days ago

Keep in mind that this blog post was written by a company builds a product that sits on top of PostgreSQL. So for the intended audience, it's completely relevant.

A good use case is exploring external APIs, many of which come back as JSON. This might be a faster awy to figure out what you're dealing with: how large, how many, unique values, mins, maxes, etc. It could be especially useful if you need to match it to existing keys or metadata that you already have in the same database. To solve this same problem before, I've written lots of one-off scripts that convert the JSON into a CSV so that I can view it in Excel. This may or may not help me there, but good to see another tool.

link

malkia 4808 days ago

It could be. In game development for example, often a game asset is nothing more than a bunch of key/values, and they are slapped either into individual file per asset, or grouping several such asset into one file. The text format is not important (now), but I've seen "ini" style ones, or using lua, xml, or nowadays json.

It's important for these files to be kept this way, because they can be submitted to a SCM (perforce, svn, etc.), and this is the typical workflow. An SQL/noSQL db won't cut it here, since there is no revision number, or even if there is, it introduces a problem of syncing what's in the DB vs. what's in SCM.

So such tool would become interresting, if after syncing your latest data of the depot (repo), you run a tool that imports them into the DB to be used by other tools. And when you are done, export is done to the same files, and submitted.

Some automation could be done as well.

And having such tools would be beneficial, since you now don't have to write another tool to be ran manually, or being resident as service to get the latest file changes and update the DB, but could be make as an PostgreSQL worker that imports on the fly changes and updates the DB at the same time.

E.g. - if you use Perforce: - Someone syncs - The DB automatically picks up the changes, informed by the file system, and updates the DB.

The more of such things the better.

link

danielrhodes 4808 days ago

It's a good way to do adhoc querying of JSON log files, which among other things, can be useful for debugging.

link

masklinn 4808 days ago

Can also be useful to mix internal SQL data from JSON from an external source without having to manually sync/import it no?

link

seppo0010 4808 days ago

The only valid use case I can think of is for preloaded data for automatized tests.

link

knowtheory 4808 days ago

Oh, that might be useful. Although, most test frameworks at this point have mechanisms for pretty easily load data, but it'd be interesting to figure out if any corners could be conveniently cut with this.

link