Hacker News new | ask | show | jobs
by misframer 3630 days ago
Well-designed text formats too?
1 comments

Not necessarily. (I'd actually argue "not at all.") Presumably you have a text format in the first place because you want your representation to be human-readable [with common tools like a text editor], and very likely human-writable as well. Those are the real constraints of a (useful) text format, and they tend to be in direct conflict with high-performance parsing, or partial parsing.

For an arbitrary example, a binary format could have an index table of objects at the start of the file, and then you could perform partial reads to access only the subset of objects you care about. That's something you could do in a text format too, but if the file is edited in a text editor you can't guarantee that the user remembered or bothered to update the index when they added a new object. The parser would effectively not be able to trust the index, and have to parse the entire file. (I suppose you could use CRCs or something to enforce this, but then you'd end up with a very brittle format that people get frustrated when trying to edit.)

Really, the true advantage of a binary format is you generally assume that nobody messed with the data behind your back, so you can have duplicate data (like an index) if you want without worrying that it's out of sync. This pretty much goes hand in hand with the fact that you can't just open it in a text editor and fiddle with stuff.

TLDR: Human-writability and high-performance are arguably mutually exclusive features.

> Really, the true advantage of a binary format is you generally assume that nobody messed with the data behind your back

I would rephrase that a bit and say the true advantage is flexibility, as you're not subject to the constraints of textual data.

The integrity of the data is a separate matter, and should be carefully verified rather than trusted implicitly. A huge amount of security vulnerabilities, and program crashes in general, come from errantly assuming that user-supplied data is correct.

Indeed. Another example is that you could have a length field in a text format that precedes a string, so you can skip over it without parsing it. But humans will forget to update it, or update it incorrectly.