Not to derail but what’s more efficient in your view? We compared messagepack, standard http/json and probufs for an internal service and protobufs came out tops on every measure we had.
The gold standard is a purpose-built protocol for each message, usually coming in ~20x faster and ~2-8x smaller than a comparable proto (it's perhaps obvious why Google doesn't do this, since the developer workload is increased for every message even in a single language, and it's linear in the number of languages you support, without the ability to shove most of the bugginess questions to a single shared library, and backwards compatibility is complicated with custom protocols -- they really do want you to be able to link against most g3 code without interop concerns). I've had a lot of success in my career with custom protocols in performance-sensitive applications, and I wouldn't hesitate to do it again.
Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos.
As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types.
Recapping on a few reasons why protos are slow:
- There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization.
- Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls.
- Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy.
- Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations.
- The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs).
- The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance.
And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields).
Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective.
What does a 'purpose-built protocol for each message' look like? You avoid type/tagging overhead, but other than that I'd expect a ""sufficiently smart"" generic protocol to be able to achieve the same level of e.g. data layout optimization. Obviously ProtoBuf in particular is pessimising for the reasons you describe, but I'm thinking of other protocols (e.g. Flatbuffers, Cap'n Proto, etc.)
The problem is that "sufficiently smart" does a lot of heavy lifting.
One way to look at the problem is to go build a sufficiently smart generic protocol and write down everything that's challenging to support in v1. You have tradeoffs between size (slow for slow networks), data dependencies (slow for modern CPUs), lane segmentation (parallel processing vs cache-friendly single-core access vs code complexity), forward/backward compatibility, how much validation should the protocol do, .... Any specific data serialization problem usually has some outside knowledge you can use to remove or simplify a few of those "requirements," and knowledge of the surrounding system can further guide you to have efficient data representations on _both_ sides of the transfer. Code that's less general-purpose tends to have more opportunities fore being small, fast, and explainable.
A common source of inefficiencies (protobuf is not unique in this) is the use of a schema language in any capacity as a blunt weapon to bludgeon the m x n problem between producers and consumers. The coding pattern of generating generic producers/consumers doesn't allow for fine-tuning of any producer/consumer pair.
Picking on flatbuffers as an example (I _like_ the project, but I'll ignore that sentiment for the moment), the vtable approach is smart and flexible, but it's poorly suited (compared to a full "parse" step) to data you intend to access frequently, especially when doing narrow operations. It's an overhead (one that reduces the ability for the CPU to pipeline your operations) you incur precisely by trying to define a generic format which many people can produce and consume, especially when the tech that produces that generic format is itself generic (operating on any valid schema file). Fully generic code is hard enough to make correct, much less fast, so in the aim of correctness and maintainability you usually compromise on speed somewhere.
For that (slightly vague) flatbuffers example, the "purpose-built protocol" could be as simple as almost anything else with a proper parse step. That might even be cap'n proto, though that also has problems in certain kinds of nested/repeated structures because of its arena allocation strategy (better than protobuf, but still more allocations and wasted space than you'd like).
Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos.
As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types.
Recapping on a few reasons why protos are slow:
- There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization.
- Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls.
- Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy.
- Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations.
- The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs).
- The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance.
And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields).
Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective.