Hacker News new | ask | show | jobs
by ellimilial 1521 days ago
If it fits on a single machine - jq, flat files, JSON lines / avro if relatively flat. Change to a tabular format if when nesting not required.

Postgres JSONB works, but it requires maintaining a heavy server process. So does Lucene/elasticsearch.

I have been yearning for embeddable store (in line with SQLite the support that both works and also keeps the data compressed like JSONB). I know there were some attempts, tried some of it those, mostly monstrosities).

2 comments

JSONB is incredibly awesome, and should be extracted from PG and made usable on its own.

For those who don't know, JSONB is a binary JSON encoding that is specifically optimized for data at rest and compression thereof.

The key feature in JSONB is that most internal pointers [from arrays and objects] to values are in the form of lengths, with every 32nd pointer being an offset. This comes from the observation that offsets will not repeat, therefore are difficult to compress w/ off the shelf compression algorithms, but length values will often be the same and thus be compressible. This means that iterating an array (say) requires 31 additions for every 32 elements to recover the offsets to those 31 elements' values.

The story of how they came to this optimization for compression is fascinating. IIRC they implemented an offsets-only JSONB and were very happy with it until they discovered that that form of JSONB did not compress anywhere near as well as expected, and since PG was close to shipping, a feverish hunt for the cause ensued that culminated in the fix of mostly-using-lengths-instead-of-offsets.

I really wish it preserved key order ... is quite annoying losing this at the storage layer ...
So preserving key order is... nice for some things, but what's nice about JSONB is that it's optimized for reading and querying.
i am curious to known an example where key order would matter.
Had this exact issue. The UBL [1] standard has a primarty XML representation where the order of elements are enforced in the schema. It also has a JSON representation, so when going from JSON to XML the exact order is needed to obtain a valid XML.

[1] https://www.oasis-open.org/committees/tc_home.php?wg_abbrev=...

I cannot describe how much I love jq. Best new (to me) tool I discovered in all of 2020.

Once you get the hang of it the syntax feels extremely powerful. The only other thing it reminded me of is the first time I learned enough SQL to be dangerous.

You really, really are going to want to check jq out at least a little if you want to improve the state of the art in this area. It has an excellent manual btw.

Edit: you ask about "metaformats" such as object per newline. jq handles this well too