Hacker News new | ask | show | jobs
by kentonv 1659 days ago
> I see that you don't fully understand the problems of composability, compatibility, and versioning and are too eager to dismiss them based on your prior experience with inferior type systems.

> You are conflating your experience with particular conventional tooling with a general availability of superior type systems and toolings out there.

You literally quoted my project as one of your two examples of superior systems and now you're telling me I don't understand how superior systems work because I have no experience with them?

1 comments

These are not mutually exclusive things, as superiority of the systems is a multi-dimensional metric. I quoted cap'n proto as an alternative to protobuf that I would definitely choose over any protobuf, because in my book it does at least a few things better. Namely, the bits related to immutability & zero-copying, and random access. But at the same time I do not like and do not agree with your field optionality stance, as I think it is based on a false premise that a universal optionality is the only viable path towards compatibility. I will cite the original article regarding the matter to clarify this point:

> protobuffers achieve their promised time-traveling compatibility guarantees by silently doing the wrong thing by default. Of course, the cautious programmer can (and should) write code that performs sanity checks on received protobuffers. But if at every use-site you need to write defensive checks ensuring your data is sane, maybe that just means your deserialization step was too permissive. All you’ve managed to do is decentralize sanity-checking logic from a well-defined boundary and push the responsibility of doing it throughout your entire codebase.

This approach doesn't free you as a developer from having to maintain multiple code-paths as you claim to be able to avoid in your older comments ("you end up needing to maintain parallel code paths, leading to ugly and hard-to-test code").

Code paths are still there, they are now intertwined with your business logic as conditional checks on a field presence. At every calling site that utilises the schema.

That's one of the reasons why I prefer flatbuffers over cap'n proto when I have a choice, and it is the reason why I think that you are not fully aware of the issues that stem from the choices of protobuf and that are clearly manifested in ecosystems that model network communications via advanced type systems.

In fact, this comment from your linked thread suggests a similar idea - advanced type systems can provide a strict schema negotiation in semi-automated way, at a fraction of the effort required to maintain schemas with all-optional fields - https://news.ycombinator.com/item?id=18201601

The "required fields considered harmful" opinion was a hard lesson learned through real experience -- the experience of repeated outages of large, complex systems like Google Search, GMail, etc. Certainly, prior to this experience, everybody assumed required fields were a good idea.

More abstractly, the hard lesson was: In a large distributed system, the site of use is the only reasonable place to do data validation. If you do it anywhere else, you will create a more brittle system that can't handle changes. The reason is pretty straightforward, but is more of a human reason than a mathematical one: when someone decides to modify a protocol for some new feature, they know they obviously have to modify the code that produces and consumes the protocol in order to implement the feature. But if they have to update a bunch of other places too, that's at best more work, and at worst easily forgotten. It's really important that any part of the system that is just a middleman will be agnostic to the data and pass it through unmodified -- even if the data is based on a newer version of the schema than the middleman is aware of.

So yes, you actually want the validation to be in your business logic. But you don't want it to complicate that business logic too much. Most of the time, optional fields (with default values) provide the right balance between making changes easy without making code ugly. Sometimes, a more drastic change -- like declaring a new version of the protocol and writing translation layers -- is a good idea, but this is an expensive step that you want to do rarely.

Now, obviously you don't agree with this. But your arguments sound like they are coming from a place of intuition, not experience. That's fine, intuition is critical to innovation. But you can't go around claiming your intuition is "superior" without proving it out in practice. Intuition is always based on a simplified model in your head, and the real world often doesn't work like you think it will. I assure you you don't know anything I don't, in decades of working on this stuff I've heard all the ideas. The only way to prove yourself right is to actually build systems your way and show success in the field. Of course, there will likely never be a definitive proof that one idea or the other is superior, only anecdotal experience. However, the fact that a large majority of successful distributed systems today are built on Protobuf or a similar model to Protobuf suggests that experience leans heavily in that model's favor.

> The "required fields considered harmful" opinion was a hard lesson learned through real experience -- the experience of repeated outages of large, complex systems like Google Search, GMail, etc. Certainly, prior to this experience, everybody assumed required fields were a good idea.

There is a common trait in the systems you mentioned: they generally allow for a permissive representation of a domain data where many of the fields could be omitted or replaced by zero-values / defaults, because most of them, by their nature, have to do with things that are optional and are tolerable to noise and accidental mistakes (percentile precision). How much of A/B test data and user tracking stats do gmail / google search encode and process as protobuf?

If you compare it to a simulation engine's data stream or a collaborative BIM / CAD model, you will find out that almost everything that travels over a network in these systems is required to be unambigous and strictly consistent at sending and receiving sites. All binary representations of physical relations in these models are not just scalar values that can tolerate a default value assigned by a protocol parser upon receiving a missing field. The scalar values appear at UI rendering / output formatting. But most of the time you deal with relations and equations and you need to be able to differentiate between missing-by-intent and missing-by-mistake cases. Zero-values will not be helpful either, because a zero value itself can be represented in multiple ways, depending on the model being evaluated and the context it's evaluated in, the values can legitimately come in different precisions, units, ratios (descrete vs dense) and so on, and those are not distinct fields, their combinations are often mutually exclusive. This is not the kind of validation you want to delegate to calling sites implemented in different languages and maintained by different teams of different technical capacity to solve the challenge of a proper validation. The invariants and constraints have to be encoded into the protocol, and required fields is a low-level "must have" bit of it.