Hacker News new | ask | show | jobs
by snidane 1704 days ago
Json is a good format to represent results of aggregation queries (group by in sql) using nesting and storing data in a single file.

Without that you would need to either

  1. store multiple not-nested (tabular, eg. csv) files and join them at the time of use.
  2. denormalize all these csvs into a single big csv duplicating the same values over and over. Compression should handle this at storage time, bht you still pay the cost when reading.
  3. store values by columns, not by rows, adding various RLE and dict encodings to compress repeated values in columns, making the files not human friendly
  4. once you store it in columns and make it unreadable, just store it as binary instead of text. You get parquet
Json and csb are simple and for that reason they won and will stay with us no matter how hard you try to add features to it.

That said I think adding a trailing comma and comments to json wouldn't be a big stretch.

The battle will be for the best columnar binary format. Parquet is the closest to a standard, but it seems to be used only as a standard for a storage. Big data systems still uncompress it and work with their own representation. The holy grail is when you get a columnar format which is good enough that big data systems use it as their underlying data representation instead of coming up with their own. I suspect such format will come from something like open sourced Snowflake, Clickhouse, Chaossearch or something like that, which has battle tested performant algorithms on them, instead of designed by committee, such as parquet.

2 comments

> That said I think adding a trailing comma and comments to json wouldn't be a big stretch.

Sadly, json's designers suffered from the same hubris as the designers of markdown and gemini, when they decided to not include a version number in the file format. So you are kind of hosed if you want to make a change like that.

Before json there was xml (ugh), but before xml there were Lisp S-expressions, which seem to have handled all these issues perfectly well 50 years ago. Yet we keep re-inventing them. Greenspun's tenth law is still with us.

It's just a matter of parser implementation. These changes are backwards compatible. If python decided to add support for comments and trailing commas in json.loads, that would become the new standard, at least for data scientists, not for web devs. All the other ones would then follow.
Now whatever generates your data has to know what parser is going to read the data. The parser can't tell right away whether the data has those trailing commas. They are optional, so they might not start appearing until after gigabytes of output have gone by. So you can't count on a quick error message in the event of a version mismatch.
If you have gigabytes of handwritten JSON (if it's not handwritten, trailing vs non-trailing commas surely don't matter), then I feel like you're doing something wrong.

Though I'm sure someone's going to step in and say "Have you not heard of [stupendously niche use case]? Are you living under a rock!?" etc etc ;)

It's silly to not write your software to handle every possible input instead of every input you think is likely based on some predictions about humans. Failure to do that is why YAML is so broken.

JSON isn't a format conducive to handwriting even if it probably should have made more accomodations for that at the start. Right now it can't even handle trailing newlines. But if you want to fix that, call it something different (maybe even JSON2), for heaven's sake.

I doubt anyone would handwrite an entire gigabyte JSON document, but they might hand-edit a machine-generated one to make a change someplace in it, end up putting in a trailing comma, and have the document pass their local tests but crash a remote parser.

> Though I'm sure someone's going to step in and say "Have you not heard of [stupendously niche use case]?"

> they might hand-edit a machine-generated one to make a change someplace in it

Ah, there's [stupendously niche use case] ;)

Seriously, though, I do agree with your point that good software should handle every edge case. I'm not arguing that.

But the case for having trailing commas does seem to be generally predicated on handwritten JSON, so I'm saying it's _unlikely_ it would be used in that way, and therefore that such failures would be rare and thus not a very grave counterargument.

You mean, Apache Arrow?
Partially.

The problem with Apache Arrow and Parquet is that you have two - one for storage and one for computation - but in the end you only want one for both. You want to run fast algorithms on memory mapped compressed columns. Not doing this stupid deserialization from parquet to arrow.

Parquet and arrow are designed by committee and try to accomplish too much for that matter. While that's good for some cases, my prediction is that there will exist a data processing system in the future whose file format will support that and be good enoigh for most data intensive applications. It will not be feature complete, like json, but will be good enough. Some devs from then on will complain about adding this and that feature to that format, but majority will be happy as they are now with json. Such format can only come from industry, not from a committee.

Right. That's why I am more interested in arrow than parquet. Going from a pure compressed storage format to incorporate computation would be more difficult than going from memory-mapped / computation format to long-term storage. Arrow already made some good choices regarding data exchange over wire, these are translatable to data exchange over time.

Of course, I am only dealing with a few hundreds GiB data, not sure at larger scale whether arrow fails.