Hacker News new | ask | show | jobs
by ornxka 1866 days ago
I'm not really sure why it has to be that hard, like, why don't we just use base64-encoded JSON or something?
2 comments

> why don't we just use base64-encoded JSON or something?

Base64 would be counterproductive, increasing both space and parsing time... presumably out of fear that JSON would contain a naughty byte for a greenfield file format. It would be much better to just design tho format to not have any naughty bytes.

Parsing time for exacutables and libraries is definitely on the critical startup path. You really want a length-delimited format, or better yet, one where offsets to various structures are stored at fixed offsets so you can find everything in O(1) time with a tiny constant factor.

Compiler writers and tool authors are perfectly comfortable working with binary file formats. There's nothing more inherently future-compatible about JSON than a forward-compatible binary format like flatbuffers. Having to escape and then unescape naughty bytes is a huge downside for text-based formats that are hardly ever read by humans.

On a side note, Zlib DEFLATE / gzip / LZMA etc. aren't magic for getting rid of space overheads. Try gzip -9'ing your system's wordlist, now convert it to UTF-16 and gzip -9'ing it. You'll see a several percentage increase in size, despite an entropy change of at most a constant and small number of bits (-log2(P(UTF-16)/P(UTF-8)). I've frequently seen huge JSON proponents use hand-wavy arguments that gzip will reduce any size differences to zero.

It's also nice if the file format is very close to being able to just be mmap()ed into the process's address space and only require minimal patching to a minimum number of pages in order to be an optimized in-memory representation.

Also, there's a huge amount of momentum behind executable formats. Incremental improvements by adding new features in new segment types or appending new fields to old data structures (where there's no ambiguity) is much preferred to wholly new formats.

So, creating a new debugging symbol section that's just flatbuffers is workable. Replacing the whole ecosystem with base64-'d JSON would have way more downsides than upsides.

On another side note, you need to be very careful with JSONifying floating point values. Many libraries don't give you bit-perfect round-tripping of IEEE-754 double precision values.

YHBT. HAND.
Heh. Hopefully some of the production systems I work with are also well-played trollings.
That's a good point about the critical path - I was thinking it would be a bit bigger and slower since you'd have to decode it, but I hadn't realized what an impact that would probably have. No bit-perfect round trips is also absolutely horrifying and I would never have even thought that would be a thing.

>Compiler writers and tool authors are perfectly comfortable working with binary file formats. There's nothing more inherently future-compatible about JSON than a forward-compatible binary format like flatbuffers. Having to escape and then unescape naughty bytes is a huge downside for text-based formats that are hardly ever read by humans.

Well, the problem isn't binary or non-binary, the problem is that these formats like ELF are, apparently, really annoying to deal with, have weird limitations, and are difficult to extend. The reason I thought of JSON in particular is because it doesn't really need to be extended to encode anything (unless you include escaping or base64 encoding binary data you want to put inside of a JSON document as "extending" it). You can encode all of the fields in ELF (or any other format) inside of JSON, while it doesn't make sense to consider the converse because ELF has fixed fields with fixed meanings.

That's the problem with these bespoke binary formats like ELF - they're not designed to encode arbitrary schemas of data, they're designed for very specific tasks and then when they get used outside of their intended environment, we get problems like have been described in this thread. Nobody has ever had these problems with a JSON document - maybe with something that consumed one, but the file format itself simply does not have the same kind of limitations like ELF does. It has different limitations, but they're not of a fundamental and semantic nature like they are in a more rigid format.

You're right that it would be a problem to have to escape/unescape every section every time you wanted to run something because that's very slow, but I think that's basically the only problem that these bespoke binary formats solve. If that's the case, I wonder why something like Matroska wouldn't work for binaries? My understanding is that it's basically binary XML and allows for basically a completely arbitrary dictionary structure. It doesn't have nice tooling like JSON or XML do, but there's no weird restrictions on things like field length that I'm aware of. I guess it doesn't exactly have any "momentum", though, but maybe the NixOS people will get sick enough of ELF to consider such a drastic solution :P

> That's the problem with these bespoke binary formats like ELF - they're not designed to encode arbitrary schemas of data, they're designed for very specific tasks and then when they get used outside of their intended environment, we get problems like have been described in this thread. Nobody has ever had these problems with a JSON document - maybe with something that consumed one, but the file format itself simply does not have the same kind of limitations like ELF does. It has different limitations, but they're not of a fundamental and semantic nature like they are in a more rigid format

This is nothing specific to binary formats, but specific to insufficiently extensible formats. Note that I specifically mentioned flatbuffers, which provide for extensibility while keeping parsing latency low.

Also, ELF was designed to be extensible by adding new sections. You could totally add functionality by adding a new section holding JSON data.

Don't confuse JSON with extensibility. I've seen plenty of headaches with poorly thought out JSON schemas where forward compatibility wasn't sufficiently well thought out. There are also tons of elegantly extensible binary formats. ELF is just old; much older than JSON. A new binary format would probably be more elegantly extensible.

> why don't we just use base64-encoded JSON or something?

You should try it and let us know how that works.