Hacker News new | ask | show | jobs
by shereadsthenews 2555 days ago
I have some real beefs with this paper. Their number about how long it takes to encode or decode a protobuf is wrong and misleading. It seems to be based on this benchmark[1] which encodes a huge repeated int32 stuffed with random numbers. This does not resemble a key-value workload at all. In a KV system you would have something like key and value as bytes fields. It would be extremely simple. By contrast this "benchmark" is the worst possible case for protobuf because encoding random data as varint guarantees that the average field takes 9 bytes instead of 8 and hits the slowest possible path in the codec. The whole paper rests on this number, so the conclusions are crap. They are not even consistent with the practical performance of memcacheg which the authors should have been very familiar with. 1: https://github.com/hq6/ProtobufBenchmark/blob/master/Benchma...
4 comments

> In a KV system you would have something like key and value as bytes fields.

Isn't that kind of custom interchange format what they're complaining about when they say that remote stores just push the complexity to the client?

In the kv stores I’ve used the client and the server are mostly the same process, or started out that way.

I don’t know many sane people who want to use the kv store as a system of record, and even the people who expect it to be exhaustive (all possible keys) make me doubt their sanity for that and other reasons.

So far, everyone has a bit of code that looks for a key and, if it is missing, performs the work necessary to build the payload. The code that creates the payload is never more than a couple function calls away from the retrieval code.

I don't know but how does complaining that it takes a microsecond to encode 256 random numbers lead to the conclusion that remote KV caches are unworkable? That's just a non sequitur.
> a huge repeated int32 stuffed with random numbers.

Is this not exactly the behavior you'd see if you used UUIDs as your keys? Asking honestly.

If you knew your keys were uniformly distributed you would never use a varint encoding to store them because you'd stand a very high chance of encoding them into a field longer than a primitive integer. A varint can only hold 28 bits of number in the first four bytes, so your odds of getting 5-byte output is 15/16 i.e. very likely. If you really had to encode them I would use either two fixed64 fields, the experimental fixed128 type, or 'bytes' with the exactly-36-byte-long string representation of a UUID. In no case could I imagine packing a huge vector of random numbers into a protobuf int32 field.
The point is that Protobuf has variable length ints by default. That’s an optimization for many common use cases, but slower and larger for random data, including GUIDs. Use Protobuf’s fixed ints for those.
Not to mention protobufs have awful performance compared to more modern alternatives in use today like Flatbuffers, Thrift, Cap'n Proto, SBE.

In the case of Google's own Flatbuffers, the layout is going to be far more performant.

I think it's irrelevant. In fact the protobuf might be the best choice. If it was just defined as so:

  bytes key = 1;
  bytes value = 2;
... your overhead can be as little as 4 bytes and you can alias the memory of the key and value (using a type like std::string_view) instead of copying it. It takes a few nanoseconds to decode a message like this.
> protobufs have awful performance compared to more modern alternatives in use today like Flatbuffers, Thrift, Cap'n Proto, SBE

Do you have a source on that? Genuinely curious.

Hi, I wrote Protobuf v2 (the version everyone uses) and Cap'n Proto.

I don't know if I'd say Protobuf has "awful" performance. It's certainly much better that text-based formats like JSON. But the format is rather branch-y. You have to process it byte-by-byte, because e.g. integers are encoded in a variable-width encoding where each byte contains 7 bits of data plus 1 bit to indicate if this is the last byte. This results in a compact encoding, but takes a lot of cycles to encode and decode. Moreover, since everything is variable-width, in order to find any one field of the message, you must scan through all previous fields, parsing them one by one.

Cap'n Proto, FlatBuffers, and SBE all use "zero-copy" encodings, meaning the data is laid out on the wire in a format that is easy for a CPU to use directly. This means, for example, that integers are fixed-width, and fields are located at fixed offsets. This is must faster to parse (or even use in-place without parsing at all), but does result in somewhat larger encodings. (But then, you can always layer on independent compression when bandwidth matters more than CPU.)

My understanding is that Thrift is closer to Protobuf and contemporaneous with it, so I don't know why GP included it the list.

For simple protocols protobuf decoding has no taken branches. I.e. if you only use the first 15 field numbers (all your tags are 1 byte) and if all the types are the expected types, and if all the variable-length items are < 128 bytes long then you can decode the message without taking any branches. In C++. Most of the other languages have simpler and slower codecs.

This is the hot path in C++[1]. A really large amount of work has gone into protobuf C++ performance in the last 3 years or so.

1: https://github.com/protocolbuffers/protobuf/blob/master/src/...

And all your integer fields must be < 128, right?

Yes, I suppose the branches in Protobuf can be pretty predictable. Still, you do generally have to examine each byte individually.

Sure. In this specific case of a kv store it's hard to imagine how to simplify it dramatically from protobuf. As a proto you might have: tag-length-key-tag-length-value. Instead you could store the key and value lengths in host format using 8-16 bytes: length-length-key-value. It's not _dramatically_ faster to decode this, and you traded away extensibility to get a marginal speedup.
Hit 'y' before copying the link; the line numbers have already shifted.
Which version of Thrift? Apache Thrift looks roughly the same as Protobuf on our benchmarks. Perhaps this is fbthrift?
> It seems to be based on this benchmark[1] which encodes a huge repeated int32 stuffed with random numbers

However, if your entire world consists of pushing around images to GPUs/TPUs, that's pretty much the only workload you ever need to optimize from your larger ops system.

Edit: not sure that this particular use case is exactly what's going on in that benchmark, now that I look more closely. Also, I wouldn't want to assert that it's a use case that matters a ton.

Encoding an image as a repeated sequence of variable-length integers would obviously be a poor choice.
Except.. that's pretty much how raw images are represented. Substituting "integers" for "bytes" if you want channel resolution, or leaving integers if you have 32bit pixel depth.

https://github.com/nothings/stb/blob/master/stb_image.h#L120

https://wiki.libsdl.org/SDL_Surface

You overlooking the critical term: variable-length.
is anybody talking to GPU through memcached GETs/PUTs?