Hacker News new | ask | show | jobs
by daurnimator 3694 days ago
FWIW the protobuf varint encoding was seen as a mistake by the inventor. https://groups.google.com/d/msg/capnproto/Qxw7YSoPP18/HS06Kc...
3 comments

First of all, I believe that Kenton Varda is not the inventor of protobuffs, but rather a Google engineer who did a bunch of work on Protobuffs v2 (I believe on both of the v2 specs for protobuffs, because there are two different v2s). He is certainly quite knowledgeable, but not "the inventor".

Second, while he does characterize it as a "huge mistake" for encoding/decoding performance reasons, there are a few caveats that are obvious here.

1) Half of his remarks here explain why varints are inappropriate for CapnProto, a separate project of his, inspired by his experience with protobuffs. CP has different design goals than PB, and in that context varints are inappropriate. Fine.

2) His argument here against varints in protobuffs (as contrasted with in capnproto) is only that they cost CPU time (and thus latency). That's surely true, but it's not at all clear to me that he's right that this is, in general, the wrong performance tradeoff.

3) In general, if you want to represent a long-tailed distribution of numbers, some variable-length encoding has merits, because the alternative (that protobuffs promote, IIUC) are fixed-width ints, which means you waste a lot of space to accommodate your long tail, and even so you might not have allocated enough space for the largest integer you ever need to handle.

FWIW, before I even joined Google, there were notes from Sanjay (the actual inventor) saying that varint was a poorly-chosen format due to excessive branching. IIRC he was arguing for a format where you can tell the length based on the first byte although I don't completely remember. In theory you could design such a format that is just as space-efficient as varint, by re-arranging bits.

Protobuf -- like most successful systems -- was not so much "designed" as it was "evolved" as a series of decisions made out of expediency rather than because they were the best possible answer. Overall it has obviously worked extremely well, but I would not assume that just because protobuf does something, it is good. (But I'd say that about just about everything, not just protobuf.)

The mistake wasn't that particular varint encoding, but that it was variable length. That means you need to perform more complex decoding which is not the tradeoff chosen for Cap N Proto.
The protobuf varint format itself actually is pretty expensive to decode-- the way it's encoded pretty much mandates a branch per seven bit chunk (of the resulting integer); you can't tell in advance how long an integer is (as opposed to a length-prefixed varint, where you can).

That's in addition to the encoding-independent overhead inherent to any variable-length integer scheme, which is what mandates the encoding/decoding step in protobuf. Combined, this makes protobuf pretty expensive to decode (protobuf varints are everywhere; they're used for field keys and length delimiters for length-delimited field types).

That's unrelated to ZigZag encoding of negative numbers, though, which is fairly straightforward and computationally inexpensive (it's a couple bitshifts and an XOR).

Funny thing is, despite all this, protobuf beats almost everything in benchmarks, in part due to excessive micro-optimization. Almost all varints in practice are 1-byte, so the larger ones don't really matter.

That said, the fact that you can't skip forward in a protobuf without parsing through all the previous fields is unfortunate in some use cases, hence Cap'n Proto.

Yeah, we've been pretty happy with protobuf in the projects where we've used it.

Cap'n Proto would've been a pretty strong contender if it had existed when we started using protobuf, though. Serialization/deserialization was on the critical path for our server-side code, so we most likely would have been able to get at least a bit of a performance boost out of Cap'n Proto.

I'm not the inventor of protobuf, just a guy who rewrote and open sourced it. My rewrite did not change the underlying encoding.