| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kentonv 2555 days ago

Hi, I wrote Protobuf v2 (the version everyone uses) and Cap'n Proto.

I don't know if I'd say Protobuf has "awful" performance. It's certainly much better that text-based formats like JSON. But the format is rather branch-y. You have to process it byte-by-byte, because e.g. integers are encoded in a variable-width encoding where each byte contains 7 bits of data plus 1 bit to indicate if this is the last byte. This results in a compact encoding, but takes a lot of cycles to encode and decode. Moreover, since everything is variable-width, in order to find any one field of the message, you must scan through all previous fields, parsing them one by one.

Cap'n Proto, FlatBuffers, and SBE all use "zero-copy" encodings, meaning the data is laid out on the wire in a format that is easy for a CPU to use directly. This means, for example, that integers are fixed-width, and fields are located at fixed offsets. This is must faster to parse (or even use in-place without parsing at all), but does result in somewhat larger encodings. (But then, you can always layer on independent compression when bandwidth matters more than CPU.)

My understanding is that Thrift is closer to Protobuf and contemporaneous with it, so I don't know why GP included it the list.

1 comments

shereadsthenews 2555 days ago

For simple protocols protobuf decoding has no taken branches. I.e. if you only use the first 15 field numbers (all your tags are 1 byte) and if all the types are the expected types, and if all the variable-length items are < 128 bytes long then you can decode the message without taking any branches. In C++. Most of the other languages have simpler and slower codecs.

This is the hot path in C++[1]. A really large amount of work has gone into protobuf C++ performance in the last 3 years or so.

1: https://github.com/protocolbuffers/protobuf/blob/master/src/...

link

kentonv 2555 days ago

And all your integer fields must be < 128, right?

Yes, I suppose the branches in Protobuf can be pretty predictable. Still, you do generally have to examine each byte individually.

link

shereadsthenews 2555 days ago

Sure. In this specific case of a kv store it's hard to imagine how to simplify it dramatically from protobuf. As a proto you might have: tag-length-key-tag-length-value. Instead you could store the key and value lengths in host format using 8-16 bytes: length-length-key-value. It's not _dramatically_ faster to decode this, and you traded away extensibility to get a marginal speedup.

link

kentonv 2555 days ago

Sure, I was speaking in general, not specifically about the key-value case.

I think most serialization frameworks are likely to be overkill for such a use case, spending more time on setup than actual parsing.

Also note that storing the value (and maybe the key) with proper alignment might make it easier to use the data in-place, saving a copy.

link

yencabulator 2550 days ago

Hit 'y' before copying the link; the line numbers have already shifted.

link