Hacker News new | ask | show | jobs
by sa46 1848 days ago
Funny timing, I've just written most of a TypeScript generator for protobufs. I learned about some fun corners of protobufs I didn't expect trying to pass the protouf conformance tests [1] (which this one passes, that's no mean feat!).

- If you write the same message multiple times, protobuf implementations should merge fields with a last write wins policy (repeated fields are concatenated). This includes messages in oneofs.

- For a boolean array, you're better off using a packed, repeated int64 (if wire size matters a lot). Protobuf bools use varint encoding meaning you need at least 2 bytes for every boolean, 1+ for the tag and type and 1 byte for the 0 or 1 value. With a repeated int64, you'd encode the tag and length in 2 varints, and then you get 64 bools per 8 bytes.

- Fun trivia: Varints take up a max of 10 bytes but could be implemented in 9 bytes. You get 7 bits per varint byte, so 9 bytes gets you 63 bits. Then you could use the most significant bit of the last byte to indicate if the last bit is 0 or 1. Learned by reading the Go varint implementation [2].

- Messages can be recursive. This is easy if you represent messages as pointers since you can use nil. It's a fair bit harder if you want to always use a value object for each nested message since you need to break cycles by marking fields as `T | undefined` to avoid blowing the stack. Figuring out the minimal number of fields to break cycles is an NP hard problem called the minimum feedback arc set[3].

- If you're writing a protobuf implementation, the conformance tests are a really nice way to check that you've done a good job. Be wary of implementations that don't implement the conformance tests.

[1]: https://github.com/protocolbuffers/protobuf/tree/master/conf...

[2]: https://github.com/golang/go/blob/master/src/encoding/binary...

[3]: https://en.wikipedia.org/wiki/Feedback_arc_set#Minimum_feedb...

1 comments

The varint format also isnt as dense on average as it could be and allows for non-canonical encodings. I.e. you can encode any integer in multiple ways (up to 9 or 10 bytes)

The solution for this is to subtract 1 from the integer every time you encode a byte (since the existence of the next byte you're adding already indicates that the intermediate value isn't 0)