Hacker News new | ask | show | jobs
by layer8 531 days ago
SQL NULL doesn’t behave like “unknown” in all contexts. That’s one issue of NULL, that it doesn’t really have consistent semantics.
1 comments

Furthermore if null only means unknown then we need a value for “known absent”, there’s a reason why null is so often used as that.
From a purely relational perspective, if some piece of data can be absent, it's a 1:N relation where N<=1, and ought to be encoded as such.

(Of course, this is rather awkward in practice, and when NULL is there, it's inevitably going to be used for that instead.)

It is encoded as such. That's why most columns are made nullable. It's crazy to say you need to use the full power of a 1:N relation with some child table when you know N cannot be greater than 1, when a nullable column already exactly encodes a 1:(0..1) relation. I'm not trying to shill for null here: one of null's great problems is exactly the fact that null can represent "unknown", "known absent", "not applicable", or even some sentinel-ish "other" escape condition, each with their own subtle differences in how they should be handled.

Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!

I don't know if it's "patently absurd" given that it's a well-known C.J. Date take on NULLs in SQL. You may disagree with him on this - I do - but the very fact that the father of relational algebra has this take should be sufficient evidence that it's not a trivial question to be dismissed without consideration.
But the "known absent" value is going to be different for different domains. For example, in EEOC databases the "known absent" value for a race would be "declined to answer". In a database of test scores, it might be "Didn't complete", but it could also be "was absent from class on exam day" so SQL can't specify what that is. On the other hand "this value is unknown" can use the same marker in all domains, and SQL chose NULL as that marker. To be completely strict about it, "have a value/don't have a value" is one piece of data if that's something you care about and "what is that value" is another one. So in an ideal system, you should have a column for "value is [present | known absent | unknown]" and a separate column for the actual value when "value is present"

Most of the time it's not that important and people can and do shortcut "null" to mean "not present" but then the issues with using null in equality statements is a result of taking the short cut, not necessarily with the logic around null.

Do you actually need that in a Boolean context? It would only be useful for evaluating self-referent claims like "this sentence is false".
Your questions might be relevant if null were limited to boolean contexts.

It’s not.

We could really use some kind of "polymorphic table" that can represent a sum type in a convenient way.