Hacker News new | ask | show | jobs
by kentonv 1658 days ago
The "required fields considered harmful" opinion was a hard lesson learned through real experience -- the experience of repeated outages of large, complex systems like Google Search, GMail, etc. Certainly, prior to this experience, everybody assumed required fields were a good idea.

More abstractly, the hard lesson was: In a large distributed system, the site of use is the only reasonable place to do data validation. If you do it anywhere else, you will create a more brittle system that can't handle changes. The reason is pretty straightforward, but is more of a human reason than a mathematical one: when someone decides to modify a protocol for some new feature, they know they obviously have to modify the code that produces and consumes the protocol in order to implement the feature. But if they have to update a bunch of other places too, that's at best more work, and at worst easily forgotten. It's really important that any part of the system that is just a middleman will be agnostic to the data and pass it through unmodified -- even if the data is based on a newer version of the schema than the middleman is aware of.

So yes, you actually want the validation to be in your business logic. But you don't want it to complicate that business logic too much. Most of the time, optional fields (with default values) provide the right balance between making changes easy without making code ugly. Sometimes, a more drastic change -- like declaring a new version of the protocol and writing translation layers -- is a good idea, but this is an expensive step that you want to do rarely.

Now, obviously you don't agree with this. But your arguments sound like they are coming from a place of intuition, not experience. That's fine, intuition is critical to innovation. But you can't go around claiming your intuition is "superior" without proving it out in practice. Intuition is always based on a simplified model in your head, and the real world often doesn't work like you think it will. I assure you you don't know anything I don't, in decades of working on this stuff I've heard all the ideas. The only way to prove yourself right is to actually build systems your way and show success in the field. Of course, there will likely never be a definitive proof that one idea or the other is superior, only anecdotal experience. However, the fact that a large majority of successful distributed systems today are built on Protobuf or a similar model to Protobuf suggests that experience leans heavily in that model's favor.

1 comments

> The "required fields considered harmful" opinion was a hard lesson learned through real experience -- the experience of repeated outages of large, complex systems like Google Search, GMail, etc. Certainly, prior to this experience, everybody assumed required fields were a good idea.

There is a common trait in the systems you mentioned: they generally allow for a permissive representation of a domain data where many of the fields could be omitted or replaced by zero-values / defaults, because most of them, by their nature, have to do with things that are optional and are tolerable to noise and accidental mistakes (percentile precision). How much of A/B test data and user tracking stats do gmail / google search encode and process as protobuf?

If you compare it to a simulation engine's data stream or a collaborative BIM / CAD model, you will find out that almost everything that travels over a network in these systems is required to be unambigous and strictly consistent at sending and receiving sites. All binary representations of physical relations in these models are not just scalar values that can tolerate a default value assigned by a protocol parser upon receiving a missing field. The scalar values appear at UI rendering / output formatting. But most of the time you deal with relations and equations and you need to be able to differentiate between missing-by-intent and missing-by-mistake cases. Zero-values will not be helpful either, because a zero value itself can be represented in multiple ways, depending on the model being evaluated and the context it's evaluated in, the values can legitimately come in different precisions, units, ratios (descrete vs dense) and so on, and those are not distinct fields, their combinations are often mutually exclusive. This is not the kind of validation you want to delegate to calling sites implemented in different languages and maintained by different teams of different technical capacity to solve the challenge of a proper validation. The invariants and constraints have to be encoded into the protocol, and required fields is a low-level "must have" bit of it.