Hacker News new | ask | show | jobs
by duality 2576 days ago
"Basically any fundamentally correct buffer encoded as message A will decode successfully as message B for any B."

This is incorrect. I suspect you're overextending proto3's treatment of unknown fields to include discarding incorrectly typed fields too. If A has field 1 types as an int, and B has field 1 typed as a string, an A message with field 1 set will not parse as a B message. However, if the A message has no fields set, or sets a field number unknown to B, that could parse successfully with "leftover" unknown fields.

3 comments

> If A has field 1 types as an int, and B has field 1 typed as a string, an A message with field 1 set will not parse as a B message.

In the C++ reference implementation, which I wrote, this is not true. The field 1 with the wrong wire type would be treated as an unknown field, not an error.

It's possible that implementations in other languages have different behavior, but that would be a bug. The C++ implementation is considered the reference implementation that all others should follow.

However, shereadsthenews' assertion is not quite right either. Specifically, a string field and a sub-message field both use the same wire type; essentially, the message is encoded into a byte string. So if message A has field 1 type string, containing some bytes that aren't a protobuf, and message B has field 1 type sub-message, then you'll get a parse error.

But it is indeed quite common that one message type parses successfully as another unrelated type.

> In the C++ reference implementation, which I wrote, this is not true. The field 1 with the wrong wire type would be treated as an unknown field, not an error.

Yeah I was about to say, protobuf C++ implementation will definitely treat it as an unknown field. I just had it do that a few days ago. :)

Ok but these messages are isomorphic on the wire:

  message enc {
    int foo = 1;
    SomeMessage bar = 2;
  }

  message dec {
    bool should_explode = 1;
    string why = 2;
  }
You can successfully decode the latter from an encoding of the former.
Minor nit, but not necessarily. For basically all values of SomeMessage, dec should fail to parse due to improperly encoded UTF8 data for field 2 (modulo some proto2 vs. proto3 and language binding implementation differences).

Change field 2 to a bytes field instead of a string field and then yes.

I should mention that I consider this a feature not a bug. The isomorphism permits an endpoint to use ‘bytes submessage_i_dont_need_to_decode’ to cheaply handle nested message structures that need to be preserved but not inspected, such as in a proxy application.
True but UTF8 enforcements was quite absent in all implementations until proto3, and the empty string would be a special case.
Bool will decode from int’s encoding??
Yes, they are both varint encoded on the wire. Refer to https://developers.google.com/protocol-buffers/docs/encoding...
Sigh.
I don't think this is the case, or at least, I'd expect it to be a bug.

Protocol Buffers should generally be non-destructive of the underlying data. That means even if it encounters the wrong wire type for a field, it should simply retain that value in the unknown field set rather than discard it.