Hacker News new | ask | show | jobs
by mattboardman 718 days ago
My biggest complaint with gRPC is proto3 making all nested type fields optional while making primitives always present with default values. gRPC is contract based so it makes no sense to me that you can't require a field. This is especially painful from an ergonomics viewpoint. You have to null check every field with a non primitive type.
4 comments

If I remember correctly the initial version allowed required fields but it caused all sorts of problems when trying to migrate protos because a new required fields breaks all consumers almost by definition. So updating protos in isolation becomes tricky.

The problem went away with all optional fields so it was decided the headache wasn't worth it.

I used to work at a company that used proto2 as part of a homegrown RPC system that predated gRPC. Our coding standards strongly discouraged making anything but key fields required, for exactly this reason. They were just too much of a maintenance burden in the long run.

I suspect that not having nullable fields, though, is just a case of letting an implementation detail, keeping the message representation compatible with C structs in the core implementation, bleed into the high-level interface. That design decision is just dripping with "C++ programmers getting twitchy about performance concerns" vibes.

Zoox?
According to the blog post of one of the guys who worked on proto3: the complexity around versioning and required fields was exacerbated because Google also has “middle boxes” that will read the protos and forward them on. Having a contract change between 2 services is fine, required fields are probably fine, have random middle boxes really makes everything worse for no discernible benefit.
proto2 allowed both required fields and optional fields, and there were pros and cons to using both, but both were workable options.

Then proto3 went and implemented a hybrid that was the worst of both worlds. They made all fields optional, but eliminated the introspection that let a receiver know if a field had been populated by the sender. Instead they silently populated missing fields with a hardcoded default that could be a perfectly meaningful value for that field. This effectively made all fields required for the sender, but without the framework support to catch when fields were accidentally not populated. And it made it necessary to add custom signaling between the sender and receiver to indicate message versions or other mechanisms so the receiver could determine which fields the sender was expected to have actually populated.

You can now detect field presence in several cases: https://protobuf.dev/programming-guides/field_presence/#pres...

This is very solid with message types, but for basic types you can add `optional field` if needed as well (essentially making the value nullable)

proto2 required fields were basically unusable for some technical reason I forget, to the point where they're banned where I work, so you had to make everything optional, when in many cases it was unnecessary. Leading to a lot of null vs 0 mistakes.

Proto3 made primitives all non-optional, default 0. But messages were all still optional, so you could always wrap primitives that really needed to be optional. Then they added optionals for primitives too recently, implemented internally as wrapper messages, which imo was long overdue.

The technical reason is that they break the end-to-end principle. In computer networking in general, it is a desirable property for an opaque message to remain opaque, and agnostic to every server and system that it passes through. This is part of what makes TCP/IP so powerful: you can send an IP packet through Ethernet or Token Ring or Wi-Fi or carrier pigeon, and it doesn't matter, it's just a bunch of bytes that get passed through by the network topology.

Protobufs generally respect this property, but required fields break it. If you have a field that is marked 'required' and then a message omitting it passes through a server with that schema, the whole message will fail to parse. Even if the schema definition on both the sender and the recipient omits the field entirely.

Consider what happens when you add a new required field to a protobuf message, which might be embedded deep in a hierarchy, and then send it through a network of heterogenous binaries. You make the change in your source repository and expect that everything in your distributed system will instantly reflect that reality. However, binaries get pushed at different times. The sender may not have picked up the new schema, and so doesn't know to populate it. The recipient may not have picked up the new schema, and so doesn't expect to read it. Some message-bus middleware did pick up the new schema, and so the containing message which embeds your new field fails to parse, and the middleware binary crashes with an assertion failure, bringing down a lot of unrelated services too.

If you want to precisely capture when a field must be present (or will always be set on the server), the field_behavior annotation captures more of the nuance than proto2's required fields: https://github.com/googleapis/googleapis/blob/master/google/...

You could (e.g.) annotate all key fields as IDENTIFIERs. Client code can assume those will always be set in server responses, but are optional when making an RPC request to create that resource.

(This may just work in theory, though – I’m not sure which code generators have good support for field_behavior.)

That decision seems practical (especially at Google scale).

I think the main problem with it, is that you cannot distinguish if the field has the default value or just wasn't set (which is just error prone).

However, there are solutions to this, that add very little overhead to the code and to message size (see e.g. [1]).

[1]: https://protobuf.dev/programming-guides/dos-donts/

But you can distinguish between default and unset: all optional fields have has_ method associated with them: https://protobuf.dev/reference/cpp/cpp-generated/#fields
The choice to make 'unset' indistinguishable from 'default value' is such an absurdly boneheaded decision, and it boggles my mind that real software engineers allowed proto3 to go out that way.

I don't get what part of your link I'm supposed to be looking at as a solution to that issue? I wasn't aware of a good solution except to have careful application logic looking for sentinel values? (which is garbage)

Yes, proto3 as released was garbage, but they later made it possible to get most proto2 behaviors via configuration.

Re: your question, for proto3 an field that's declared as "optional" will allow distinguishing between set to default vs. not set, while non-"optional" fields don't.

Ah, ok! Yeah I think we've been working with an older version of protobuf for a while where that wasn't an option.
Yet still in 2024, supporting optional is off by default for some languages in protoc...
RPC/proto is for transport.

Required/validation is for application.

If that's true, why have types in proto at all? Shouldn't everything be an "any" type at the transport layer, and the application can validate the types are correct?
Different types can and do use different encodings
It was a little weird at first, but if you just read the 1.5 pages of why protobuf decided to work like this it made perfect sense to me. It will seem over complicated though to web developers who are used to being able to update all their clients at a whim at any moment.
proto3 added support for optional primitives sorta recently. I've always been happy without them personally, but it was enough of a shock for people used to proto2 that it made sense to add them back.
Just out of curiosity, what domain were you working in where "0.0" and "no opinion" were _always_ the same thing? The lack of optionals has infuriated me for years and I just can't form a mental model of how anybody ever found this acceptable to work with.
Like nearly every time, empty string and 0 for integers can be treated the same as "no value" if you think about it. Are you sending data or sending opinions? Usually to force a choice, you would make a enum or a one-of where the zero-value means the client has forgotten to set it and it can be modelled as a api error. Whether the value was actually on the wire or not is not really that important.
0 as a default for "no int" is tolerable, 0.0 as a default for "no float" is an absolute nightmare in any domain remotely related to math, machine learning, or data science.

We dealt with a bug that for weeks was silently corrupting the results of trials pitting the performance of various algos against each other. Because a valid response was "no reply/opt out", combined with a bug in processing the "opt out" enum, also combined with a bug in score aggregation, functions were treated like they replied "0.0" instead of "confidence = None".

It really should have defaulted NaN for missing floats.

What about floats made this a problem? We're treating 0.0 specially in some places.
I think your anecdote is rather weak with regards to the way protobuf works, but to entertain, why would a confidence of 0.0 be so different from None? 0.0 sounds very close to None for most numerical purposes if you ask me.

Wait, are you using Python?

message LatLon {

    double lat = 1; // Geodetic latitude, decimal degrees

    double lon = 2; // Geodetic longitude, decimal degrees
}

"Hmm, lat = 0, I guess they didn't fill out the message. I'll thrown an exception and handle it as an api error"

[Later, somewhere near the equator]

"?!?!??!"

------------------

"Ok, we learned our lesson from what happened to our customers in Sao Tome and Principe: 0.0 is a perfectly valid latitude to have. No more testing for 0, we'll just trust the value we parse.

[Later, in Norway, when a bug causes a client to skip filling out the LatLon message]

"Why is it flying to Africa?!?!"

------------------

Ok, after the backlash from our equatorial customers and the disaster in Norway, we've learned our lesson. We will now use a new message that lets us handle 0's, but checks that they really meant it:

message LatLonEnforced {

    optional double lat = 1;

    optional double lon = 2;
}

[At some third party dev's desk] "Oh, latitude is optional - I'll just supply longitude"

[...]

"It's throwing exceptions? But your schema says it's optional!"

------------------

Ok, it took some blood sweat and tears but we finally got this message definition licked:

message LatLonEnforced {

    optional double lat = 1; // REQUIRED. Geodetic latitude, decimal degrees

    optional double lon = 2; // REQUIRED. Geodetic longitude, decimal degrees
}

[Later, in an MR from a new hire] "Correct LatLon doc strings to reflect optional semantics"

Yes it was python but that has nothing to do with it. Same would happen in go, rust, R, or matlab.

Correct answers: 1.0, 0.0, 1.0

Confidence from algo: 1.0, 0.0, n/a

Confidence on the wire: 1.0, 0.0, 0.0

Score after bug: 66%

Score as it ought to be scored: 100%

It was enough to make several algorithms which were very selective in the data they would attempt to analyze (think jpg vs png images) went from "kinda crap" in the rankings to "really good"

Yeah, I think it's best to first rethink of null as just, not 0 and not any other number. What that means depends on the context.

Tangent: I've seen an antipattern of using enums to describe the "type" of some polymorphic object instead of just using a oneof with type-specific fields. Which gets even worse if you later decide something fits multiple types.

I love oneofs in the language with good support, but they are woeful in Golang and the "official" grpc-web message types.
Not really one domain in particular, just internal services in general. In many cases, the field is intended to be required in the first place. If not, surprisingly 0 isn't a real answer a lot of the time, so it means none or default: any kind of numeric ID that starts at 1 (like ticket numbers), TTLs, error codes, enums (0 is always UNKNOWN). Similarly with empty strings.

I have a hard time thinking of places I really need both 0 and none. The only example that comes to mind is building room numbers, in some room search message where we wanted null to mean wildcard. In those cases, it's not hard to wrap it in a message. Probably the best argument for optionals is when you have a lot of boolean request options where null means default, but even then I prefer instead naming them such that the defaults are all false, since that's clearer anyway.

It did take some getting used to and caused some downstream changes to how we design APIs. I think we're better for it because the whole null vs 0 thing can be tedious and error-prone, but it's very opinionated.

Any time you have a scalar/measurement number, basically any value with physical units, counts, percentages, anything which could be in the denominator of a ratio, those are all strong indicators of a "semantic zero" and you really want to tell the difference between None and 0. They are usually floats, but could be ints (maybe you have number_of_widgets_online, 0 means 0 units, None means "idk".)
What's the difference between none inches and 0 inches? Might need a concrete example. We deal with space a fair amount and haven't needed many optionals there.
They just gave a concrete example: The difference between "we don't know" and "we know that it's zero".

Here's another fun one: I've seen APIs where "0.0" was treated as "no value, so take the default value". The default value happened to be 0.2.

I had the same experience. It was a bit awkward for 6 months, but down the line we learned to design better apis, and dealing with nullable values are tedious at best. Its just easier knowing that a string or integer will _never_ cause a nullpointer.
Huh, yeah I see. I guess I work more on the robotics side where messages often contain physical or geometric quantities, and that colors my thinking a bit. So "distance to that thing = 0" is a very possible situation, and yet you also want to allow it to say "I didn't measure distance to that thing". And those are very distinct concepts you never want to conflate.
I can see that. Or the rare situations where I needed 0 vs null, if for some reason that situation was multiplied times 100, I'd start wanting an optional keyword.
Same, I rarely felt a need for this distinction.