Hacker News new | ask | show | jobs
by ghostwriter 1659 days ago
https://reasonablypolymorphic.com/blog/protos-are-wrong/inde...

For those who still want / need binary protocols and schemas, look at FlatBuffers or Cap'n Proto instead. At least they are capable of representing domain structures properly.

2 comments

Sorry, I'm the author of Cap'n Proto and I think that article is full of shit.

My previous commentary: https://news.ycombinator.com/item?id=18190005

Thanks for Cap'n Proto. It's better than ProtoBuf, but I prefer FlatBuffers even more. I think the article is clearly indicating the issues that a wider community of conventional type systems in their mainstream languages is not fully aware of. And I disagree with your comments. Firstly, I don't like that you are labelling the author of the article as a "PL design theorist who doesn't have a clue" (my interpretation applied):

> his article appears to be written by a programming language design theorist who, unfortunately, does not understand (or, perhaps, does not value) practical software engineering.

I'm not the author, but they mention their prior industrial experience with protobufs at Google, among other unnamed places.

I'm not a PL theorist either, and I see that you don't fully understand the problems of composability, compatibility, and versioning and are too eager to dismiss them based on your prior experience with inferior type systems. And here's why I think it is the case:

> > This is especially true when it comes to protocols, because in a distributed system, you cannot update both sides of a protocol simultaneously. I have found that type theorists tend to promote "version negotiation" schemes where the two sides agree on one rigid protocol to follow, but this is extremely painful in practice: you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code. Inevitably, developers are pushed towards hacks in order to avoid protocol changes, which makes things worse.

You are conflating your experience with particular conventional tooling with a general availability of superior type systems and toolings out there. There's a high demand in utilising their properties in protocol designs today, where most of the currently popular protocols are hampering type systems for no good reason (no productivity gain, no performance gain, no resource utilisation gain).

Version negotiation is not the only option available to a protocol designer. It is possible to use implicit-for-client and explicit-for-developer strategies to schema migration. It is also possible to semi-automate inference of those strategies. Example [1]

> This seems to miss the point of optional fields. Optional fields are not primarily about nullability but about compatibility. Protobuf's single most important feature is the ability to add new fields over time while maintaining compatibility.

There are at least two ways to achieve compatibility, and the optional fields that expand a domain type to the least common denominator of all encompassing possibilities is the wrong solution to this. Schema evolution via unions, versioning, and migrations is the proper approach that allows for strict resolution of compatibility issues with a level of granularity (distinct code paths) you like.

> Real-world practice has also shown that quite often, fields that originally seemed to be "required" turn out to be optional over time, hence the "required considered harmful" manifesto. In practice, you want to declare all fields optional to give yourself maximum flexibility for change.

This is false. In practice I want a schema versioning and deprecation policies, and not ever-growing domain expansion to the blob of all-optional data.

> It's that way because the "oneof" pattern long-predates the "oneof" language construct. A "oneof" is actually syntax sugar for a bunch of "optional" fields where exactly one is expected to be filled in.

this is not true either, and it doesn't matter what pattern predates which other pattern. Tagged unions are neither a language construct nor a syntax sugar, it's a property of Type Algebra where you have union- and product-compositions. Languages that implement Type Algebra don't do it to just add another fancy construct, they do it to benefit from mathematical foundations of these concepts.

> How do you make this change without breaking compatibility?

you version it, and migrate over time at your own pace without bothering your clients too often [1]

[1] https://github.com/typeable/schematic#migrations

> I see that you don't fully understand the problems of composability, compatibility, and versioning and are too eager to dismiss them based on your prior experience with inferior type systems.

> You are conflating your experience with particular conventional tooling with a general availability of superior type systems and toolings out there.

You literally quoted my project as one of your two examples of superior systems and now you're telling me I don't understand how superior systems work because I have no experience with them?

These are not mutually exclusive things, as superiority of the systems is a multi-dimensional metric. I quoted cap'n proto as an alternative to protobuf that I would definitely choose over any protobuf, because in my book it does at least a few things better. Namely, the bits related to immutability & zero-copying, and random access. But at the same time I do not like and do not agree with your field optionality stance, as I think it is based on a false premise that a universal optionality is the only viable path towards compatibility. I will cite the original article regarding the matter to clarify this point:

> protobuffers achieve their promised time-traveling compatibility guarantees by silently doing the wrong thing by default. Of course, the cautious programmer can (and should) write code that performs sanity checks on received protobuffers. But if at every use-site you need to write defensive checks ensuring your data is sane, maybe that just means your deserialization step was too permissive. All you’ve managed to do is decentralize sanity-checking logic from a well-defined boundary and push the responsibility of doing it throughout your entire codebase.

This approach doesn't free you as a developer from having to maintain multiple code-paths as you claim to be able to avoid in your older comments ("you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code").

Code paths are still there, they are now intertwined with your business logic as conditional checks on a field presence. At every calling site that utilises the schema.

That's one of the reasons why I prefer flatbuffers over cap'n proto when I have a choice, and it is the reason why I think that you are not fully aware of the issues that stem from the choices of protobuf and that are clearly manifested in ecosystems that model network communications via advanced type systems.

In fact, this comment from your linked thread suggests a similar idea - advanced type systems can provide a strict schema negotiation in semi-automated way, at a fraction of the effort required to maintain schemas with all-optional fields - https://news.ycombinator.com/item?id=18201601

The "required fields considered harmful" opinion was a hard lesson learned through real experience -- the experience of repeated outages of large, complex systems like Google Search, GMail, etc. Certainly, prior to this experience, everybody assumed required fields were a good idea.

More abstractly, the hard lesson was: In a large distributed system, the site of use is the only reasonable place to do data validation. If you do it anywhere else, you will create a more brittle system that can't handle changes. The reason is pretty straightforward, but is more of a human reason than a mathematical one: when someone decides to modify a protocol for some new feature, they know they obviously have to modify the code that produces and consumes the protocol in order to implement the feature. But if they have to update a bunch of other places too, that's at best more work, and at worst easily forgotten. It's really important that any part of the system that is just a middleman will be agnostic to the data and pass it through unmodified -- even if the data is based on a newer version of the schema than the middleman is aware of.

So yes, you actually want the validation to be in your business logic. But you don't want it to complicate that business logic too much. Most of the time, optional fields (with default values) provide the right balance between making changes easy without making code ugly. Sometimes, a more drastic change -- like declaring a new version of the protocol and writing translation layers -- is a good idea, but this is an expensive step that you want to do rarely.

Now, obviously you don't agree with this. But your arguments sound like they are coming from a place of intuition, not experience. That's fine, intuition is critical to innovation. But you can't go around claiming your intuition is "superior" without proving it out in practice. Intuition is always based on a simplified model in your head, and the real world often doesn't work like you think it will. I assure you you don't know anything I don't, in decades of working on this stuff I've heard all the ideas. The only way to prove yourself right is to actually build systems your way and show success in the field. Of course, there will likely never be a definitive proof that one idea or the other is superior, only anecdotal experience. However, the fact that a large majority of successful distributed systems today are built on Protobuf or a similar model to Protobuf suggests that experience leans heavily in that model's favor.

> The "required fields considered harmful" opinion was a hard lesson learned through real experience -- the experience of repeated outages of large, complex systems like Google Search, GMail, etc. Certainly, prior to this experience, everybody assumed required fields were a good idea.

There is a common trait in the systems you mentioned: they generally allow for a permissive representation of a domain data where many of the fields could be omitted or replaced by zero-values / defaults, because most of them, by their nature, have to do with things that are optional and are tolerable to noise and accidental mistakes (percentile precision). How much of A/B test data and user tracking stats do gmail / google search encode and process as protobuf?

If you compare it to a simulation engine's data stream or a collaborative BIM / CAD model, you will find out that almost everything that travels over a network in these systems is required to be unambigous and strictly consistent at sending and receiving sites. All binary representations of physical relations in these models are not just scalar values that can tolerate a default value assigned by a protocol parser upon receiving a missing field. The scalar values appear at UI rendering / output formatting. But most of the time you deal with relations and equations and you need to be able to differentiate between missing-by-intent and missing-by-mistake cases. Zero-values will not be helpful either, because a zero value itself can be represented in multiple ways, depending on the model being evaluated and the context it's evaluated in, the values can legitimately come in different precisions, units, ratios (descrete vs dense) and so on, and those are not distinct fields, their combinations are often mutually exclusive. This is not the kind of validation you want to delegate to calling sites implemented in different languages and maintained by different teams of different technical capacity to solve the challenge of a proper validation. The invariants and constraints have to be encoded into the protocol, and required fields is a low-level "must have" bit of it.

That’s a good idea. I bet the creator of cap’n proto would tell us what a bad idea this is. What does kentonv have to say?
> I bet the creator of cap’n proto would tell us what a bad idea this is

You can draw your own conclusion based on the provided arguments and some additional exploratory work. Someone else's opinion is good but optional and is not always as insightful as your own discoveries.

Heyo