Hacker News new | ask | show | jobs
by jonbronson 2364 days ago
"The solution is as follows:

Make all fields in a message required. This makes messages product types."

Except it also breaks backwards compatibility, one of the most powerful and sought-after features of protobufs.

2 comments

> Except it also breaks backwards compatibility, one of the most powerful and sought-after features of protobufs.

It doesn't have to. Just add row types to handle unknown content, ie. if an intermediary knows only of fields foo and bar, then they can process any data with such fields if given a type like "type SomeRecord = { foo : int, bar : string | r }", where 'r' represents the remainder of the record.

The article's criticisms are valid and there are typed solutions to most of the objections that have been raised against it.

I'm not sure that's simple enough to be a "just", but in any case the primary problem is the other direction. If I add `required baz: int` to my service's definition of a protobuf, all protobufs that have ever been generated before become invalid because they don't contain a value for baz.
That fact doesn't change if you eschew types. Backward-compatible schema evolution has rules.
Right, that's the point. The article's suggestion to "make all fields in a message required" fundamentally misunderstands the issues at hand, because no matter how appealing it is from a type theory perspective, following that suggestion would make it impossible to ever add a field in a backwards compatible manner.
> The article's suggestion to "make all fields in a message required" fundamentally misunderstands the issues at hand, because no matter how appealing it is from a type theory perspective, following that suggestion would make it impossible to ever add a field in a backwards compatible manner.

You absolutely could in multiple ways:

1. You make every accepted product type have a row type at your service interface if you expect schema evolution.

2. If you have to add a field unexpectedly, ie. where you did not have a row type, then you must deprecate the old API. If this seems onerous to you, then your service infrastructure is probably insufficiently flexible.

Option 1 seems like it defeats the point. If you're going to declare a field with a more permissive type than currently allowed, aren't you just hacking weak types back into your strong type system?

Option 2... look. I've seen a lot of API deprecations, across multiple teams in multiple companies, and every one of them was very onerous in ways that had little to do with the service infrastructure. If you've done easy API deprecations, more power to you, but I don't think your experience is representative.

Protocol buffers already do that; serialized fields that are not recognized by an older message definition are parsed and can be accessed via the "unknown fields" API, exactly as "r" above. Intermediaries can pass these through trivially, or inspect them to see what they didn't understand.

The problem with making fields required is that older serialized protocol buffers parsed by newer message definitions may be missing newly added required fields, which will break things.

Protobuf does not do this via a typed interface, but via runtime checking.
You can't statically typecheck deserialized data. You must validate that deserislized value matches the schema, and you can only do so at runtime.

In other words, proto has a typed interface, but you must runtime check that a given bag of bytes conforms to that typed interface.

This is true for any io.

> You must validate that deserislized value matches the schema, and you can only do so at runtime

I assume you mean serialised data, not deserialized. And yes, deserializing includes type checking. The point is that this happens once and the need for a separate API for dynamic data shouldn't be needed.

What do you mean by a separate api for dynamic data?

The data under discussion isn't "dynamic", it's still static, it just isn't known to the schema in question at runtime (since it's only known to a different schema). That means you can't access it by name, since the field names aren't known.

The lesson is: when you start wrong, you stay wrong.