| The gold standard is a purpose-built protocol for each message, usually coming in ~20x faster and ~2-8x smaller than a comparable proto (it's perhaps obvious why Google doesn't do this, since the developer workload is increased for every message even in a single language, and it's linear in the number of languages you support, without the ability to shove most of the bugginess questions to a single shared library, and backwards compatibility is complicated with custom protocols -- they really do want you to be able to link against most g3 code without interop concerns). I've had a lot of success in my career with custom protocols in performance-sensitive applications, and I wouldn't hesitate to do it again. Barring that though, capnproto and flatbuffers (perhaps with compression on slow networks) are usually faster than protos. Other people have observed that performance deficit on many occasions and made smaller moderately general-purpose libraries before too (like SBE). They all have their own flavors of warts, but they're all often much faster for normal use cases than protos. As a hybrid, each project defining its own (de)serializer library can work well too. I've done that a few times, and it's pretty easy to squeeze out 10x-20x throughput for the serialization features your project actually needs while still only writing the serialization crap once and reusing it for all your data types. Recapping on a few reasons why protos are slow: - There's a data dependency built into the wire format which is very hard to work around. It blocks nearly all attempts at CPU pipelining aND vectorization. - Lengths are prefixed (and the data is variable-length), requiring (recursively) you to serialize a submessage before serializing its header -- either requiring copies or undersized syscalls. - Fields are allowed to appear in any order, preventing any sort of code which might make the branch predictor happy. - Some non-"zero-copy" protocols are still quite fast since you can get away with a single allocation. Since several decisions make walking the structure slow, that's way more expensive that it should be for protos, requiring either multiple (slow) walks or recursive allocations. - The complexity of the format opens up protos to user error. Nonsense like using a 10-byte slow-to-decode-varint for the constant -1 instead of either 1, 4, or 8 fast-to-decode bytes (which _are_ supported by the wire format, but in the wild I see a lot of poorly suited proto specs). - The premise in the protocol that you'll decode the entire type exactly as the proto defines prevents a lot of downstream optimizations. If you want a shared data language (the `.proto` file), you have to modify that language to enforce, e.g., non-nullability constraints (you'd prefer to quickly short-circuit those as parse errors, but instead you need extra runtime logic to parse the parsed proto). You start having to trade off reusability for performance. And so on. It's an elegant format that solves some real problems, but there are precious few cases where it's a top contender for performance (those cases tend to look like bulk data in some primitive type protos handle well, as opposed to arbitrary nesting of 1000 unrelated fields). Specific languages might have (of course) failed to optimize other options so much that protos still win. It sounds like you're using golang, which I've not done much with (coming from other languages, I'm mildly surprised that messagepack didn't win any of your measurements), and by all means you should choose tools based on the data you have. My complaints are all about what the CPU is capable of for a given protocol, and how optimization looks from a systems language perspective. |