Hacker News new | ask | show | jobs
by nevir 727 days ago
> because really it comes down to the frequency of data parsing into and out of the protobuf format.

Protobuf is intentionally designed to NOT require any parsing at all. Data is serialized over the wire (or stored on disk) in the same format/byte order that it is stored in memory

(Yes, that also means that it's not validated at runtime)

Or are you referencing the code we all invariably write before/after protobuf to translate into a more useful format?

4 comments

You’re likely thinking of Cap’n’Proto or flatbuffers. Protobuf definitely requires parsing. Zero values can be omitted on the wire so there’s not a fixed layout, meaning you can’t seek to a field. In order to find a fields value, you must traverse the entire message, and decode each tag number since the last tag wins.
> Data is serialized over the wire (or stored on disk) in the same format/byte order that it is stored in memory

That's just not true. You can read about the wire format over here, and AFAIK no mainstream language stores things in memory like this: https://protobuf.dev/programming-guides/encoding

I've had to debug protobuf messages, which is not fun at all, and it's absolutely parsed.

> Protobuf is intentionally designed to NOT require any parsing at all.

As others have mentioned, this is simply not the case, and the VARINT encoding is a trivial counterexample.

It is this required decoding/parsing that (largely) distinguishes protobuf from Google's flatbuffers:

https://github.com/google/flatbuffers

https://flatbuffers.dev/

Cap'n Proto (developed by Kenton Varda, the former Google engineer who, while at Google, re-wrote/refactored Google's protobuf to later open source it as the library we all know today) is another example of zero-copy (de)serialization.

> Protobuf is intentionally designed to NOT require any parsing at all

This is not true at all. If you have a language-specific class codegen'd by protoc then the in-memory representation of that object is absolutely not the same as the serialized representation. For example:

1. Integer values are varint encoded in the wire format but obviously not in the in-memory format

2. This depends on the language of course but variable length fields are stored inline in the wire format (and length-prefixed) while the in-memory representation will typically use some heap-allocated type (so the in-memory representation has a pointer in that field instead of the data stored inline)