Hacker News new | ask | show | jobs
by zigzag312 1445 days ago
> Personally, I think that the existing PostgreSQL behavior (NULLS DISTINCT) is the "right" one, and the other option was mainly intended for compatibility with other SQL implementations.

Care to explain why you think NULLS DISTINCT is the "right" default behavior? What problems does it solve to warrant additional complexity by default?

9 comments

I think it's worth pointing out what the opposite means - to say that "this field might have a value, or it might be null - but only one tuple/row can set this field to be null" implies that you have given the null option some kind of real-world value.

That is to say, you are using null to encode some kind of meaning, when really this is not what null is supposed to be used for.

That's not to say I think we should be morally disapproving of people who do that - I use things for their unintended purpose all the time, and it bugs me when people get on a high horse. Use what makes sense to you - and I love this change for exactly that reason.

But the general theoretical approach is that if you want to care about the value in a field, you need to give it a value. Null is for the valueless, and if that isn't an allowed state, you should simply set the field as not-nullable. Theory is fine in theory. For a practical database that exists in the real world, this option is a good addition.

Fauna has an interesting approach to null that I've grown to like:

Null is a real value that can be compared intuitively like any other value (`Equals(null, null)` returns true). [1]

However, in indexes, any term that evaluates to null is simply not stored in the index. So we can create a unique index with multiple null values because they simply won't exist in the index, so won't violate the unique constraint.

If we do care about null values in a particular index, we can handle that by mapping the null value into another value that will get stored in indexes using a "binding", and then use that index to query for or constrain uniqueness by that mapped value [2].

This is not the most convenient thing in the world, but at the end of the day, it feels like an edge case, so I'm happy with having to jump through some hoops to handle it in exchange for making the much more common cases (null comparisons and unique indexes with nullable values) more intuitive and less error prone.

[1] https://fauna.com/blog/understanding-nothing-or-null-in-faun...

[2] https://docs.fauna.com/fauna/current/learn/cookbook/fql/sear...

I think a lot of the trouble stems from the fact that databases do not have sum types and there is no way to encode a "None" type without hacks. NULL is the only reasonable option in a lot of cases.
Yeah, in hindsight the world would be a slightly better place if SQL didn't include NULL, but instead "UNKNOWN" and "NONE", with "NONE = NONE" being true, and "UNKNOWN = UNKNOWN" being unknown.
Or have an Option (or Maybe) type. This is how modern programming languages solve the nil issue, I don't see a reason why a database couldn't take the same approach.
When would NONE show up and when would UNKNOWN show up?
NONE would mean “there is no value”, and UNKNOWN “there might be a value, but I don’t know what it is”.
I'm not the person you're responding to, but also think NULLS DISTINCT makes sense in many cases. NULLs often represent missing data. Imagine storing a customer's address and the street name is NULL. If several customers have a street=null, it doesn't mean that they have the same street. So from a data perspective, it makes sense to treat these unknown values as distinct.

For filtering and aggregation I still welcome the change, as it makes sense in many cases to treat them as not distinct.

Thanks. It seems that the issue arises from value equality vs optional value state equality.

To me, a more natural way to treat NULLs is to think of NULL not as a value, but as a state. Several customers with a street=null all have street property in an equal state. However, an equal state doesn't mean the value is also equal, as there is no value in that state. Option type in functional languages models this perfectly:

  'a option = 
    |None       // no value here
    |Some of 'a // value of type 'a
So, when checking, if customers have the same street we need to check the value. So, comparision is only valid if street has value defined. Adding null check in addition to equality check to ensure that the actual values are equal feels most natural to me (as this is how it's done in most languages).

Unique constraints should not be a problem. Unique constraints are about values. So if state is NULL there is no value and so constraint does not treat NULL states as equal values. Adding a keyword to change constraint behavior when needed, would be best IMO, as it would need to be rarely used.

> So, when checking, if customers have the same street we need to check the value. So, comparision is only valid if street has value defined. Adding null check in addition to equality check to ensure that the actual values are equal feels most natural to me (as this is how it's done in most languages).

Is it tho?

Languages with ubiquitous nullability will accept both being null or both being non-null and the same value. Languages with option types do the same (as long as the value is equatable obviously), this includes OCaml: https://v2.ocaml.org/api/Option.html#preds.

The only languages I can think of which seriously diverge would be C and C++ where this is UB.

Maybe if you way stretch it Java because it doesn't have operator overloading, but even there there's Object.equals which is null-safe... and returns `true` if both parameters are `null`.

I completely agree. I tried to explain basically the same thing as you, but it seems there is some ambiguity in my example I didn't intend.

Comparing two nullable/option values when both are in a "null state" will return true in most languages. Which, to me, is the most logical result as they have the same state. So, if you want to check if two people actually live on the same street, it's not enough to check p1.street == p2.street, but you need to also check that at least one of the streets is not null. Two nulls would return true, but if equality returns true and we know that one of the streets isn't null, we know we are comparing actual street values (not null states). Sadly, in SQL, NULL=NULL result is not true, but NULL.

It seems SQL NULL means both undefined and none which seems to cause confusion.
Not OP, but this constraint means "the data must either be empty or unique", which is an extremely common constraint. In contrast, I've never encountered "only 1 entry is allowed not to have the data, all others must be unique".
"only 1 entry is allowed not to have the data, all others must be unique" makes sense in larger schemas as the database grows. It replaces is_default_address, is_primary_child etc. fields and constraints on relational tables. In the perfect world all the data would be normalized not to have such columns and cases, but in real life it just grows that way. So for some cases NULLS NOT DISTINCT will be a welcome addition.
Not parent commenter, but to me it seems "obviously correct".

As explained in the article NULL means "unknown". Let's say I have two people, and a "tax identifier" column. Let's say two people can't have the same tax id. Obviously "don't have one" or "unknown" fit very well in NULL, and two people can both be missing a tax id, even if those who have it need it to be unique.

Except that's the ambiguity of NULL in the standard. It doesn't just mean "unknown". It can also mean "no value". And you can clearly compare two elements with no value.

Which is why nulls are so incredibly confusing.

Yeah that's true. All the database can say is "I don't have a value here". Not "there exists no value for this column".

But this doesn't mean that two NULLs are the same. Two people with NULL tax IDs are not the same (one turns out to be unknown but existing, the other is a toddler without a tax ID). Ten people with NULL address don't live in the same house. Half declined to give address, the other half are homeless. Either way we can't send mail to them.

If you use the database to have a UNIQUE constraint of "only one purchase per household" then it makes no sense to allow NULL addresses, but only allow the first NULL address purchaser to buy the thing.

Or "sorry, in this hotel we only allow one guest at a time without a car, and someone else already doesn't have a car".

Does that guest without a car actually have a car? I don't think that's something that the database can solve.

Should databases have two separate values for unknown or no value? That sounds like a world of hurt, with two types of null.

> Should databases have two separate values for unknown or no value? That sounds like a world of hurt, with two types of null.

It’s also a non-problem in practice, because if an application needs to distinguish between multiple types of nulls, it can very easily just use an extra column holding the information needed to disambiguate.

Similarly, the need for non-boolean columns is a non-problem in practice, because an application can just make an extra column for every bit it needs to store. If it needs strings or arbitrary-width numbers it can just join on other tables that store those bits.

(Existence of a workaround doesn't turn a problem into a non-problem)

I stand by it being a non-problem in practice because the workaround is trivial, and the supposed solution adds complexity everywhere.

There isn’t even a universally agreed methodology for how to handle nulls in the relational model in theory.

It’s a semantic problem that can only be solved with reference to the requirements of a particular use case, and when you do that using explicit extra columns, the problem evaporates.

Which happen in world of javascript. It has null and undefined. Null variable is a variable / property that's defined and set with null, while undefined is a variable / property that's not defined yet (or maybe not initialized, I forgot).

I hope at least SQL will give another comparison operation, such as the current equal sign (=) means a.prop is not null && b.prop is not null && a.prop equal b.prop (for not defined yet).

Let's say another sign like a.prop *= b.prop, meaning a.prop is null && b.prop is null || a.prop = b.prop (for our intention is both value are empty).

Such a comparison operator exists: IS (NOT) DISTINCT FROM.
> It doesn't just mean "unknown". It can also mean "no value".

Isn't the canonical representation of known-no-value an absence of a tuple? Like as opposed to saying "There exists an employee X who works in department NULL", you simply don't make any claim about employee X working in a department? After all, when enumerating the members of a set, you're also omitting the enumeration of non-members of a set, and the law of excluded middle applies.

Nulls will reappear in your queries due to joins even if none exists in your data model.
Ah yes, they're the curse of SQL. Still doesn't seem to be a reason to not squash them out as much as feasible within the constraints of SQL.
It's probably a good default because it's more consistent with how NULL equality is handled in SQL generally. From the article:

> In Postgres 14 and prior, unique constraints treated NULL values as not equal to other NULL values.

...

> This is consistent with the SQL Standard handling of NULL in general, where NULL is unknown. It is impossible to determine if one unknown is equal to another unknown. Because NULL values are of unknown equality to one another, they do not violate UNIQUE constraints.

Your last statement would be better written as: Because NULL values are of unknown equality, they may or may not violate UNIQUE constraints.

The safest implementation would assume that they do violate the constraint, rather than the current behavior assuming they don’t.

In practice I don’t think I’ve ever added a unique index on a nullable column where null might imply unknown. I have used it in cases where null meant “none”.

> Care to explain why you think NULLS DISTINCT is the "right" default behavior? What problems does it solve to warrant additional complexity by default?

It's the most consistent with the equality behavior of null values elsewhere.

I generally agree that, in most cases, you want the NULLS DISTINCT behavior. But thank you for providing such a developer-friendly feature that allows flexibility here!

NULL in SQL is not terribly consistent overall. Sometimes NULL is treated like "unknown" and sometimes more like "n/a". And the interaction with non-scalar types (like records) is pretty strange. Also, it's common for ORMs to map the app language's NULL (or None/Nil/whatever) to SQL NULL, which adds its own nuance. So I can see this being a useful feature.

Well, if NULL is most often meant to signify an unknown value. You can't compare one unknown value with another and say they are the same.

The standard is, I believe, ambiguous about NULL because it can also mean that it is an absence of data. In other words, it is "this is not yet determined" (aka a CAR table has a COLOR table, but the paint is applied as the final operation so NULL could be used as the car is not at this stage yet and the color will be chosen later). In this case, you can compare NULLs against other NULLs, because you can have two cars that are in an indeterminant state. In that case, it is useful to compare to NULL values as it means the same thing.

An example where NULLS DISTINCT might make sense in a constraint is when a table is used to store an analytical cube roll-up (eg output of GROUP BY ROLLUP), where NULL in a dimension column has special meaning (it indicates subtotals). Having multiple rows with the same NULL key could, in some of those cases, be an error leading to double-counting.
Once you've accepted that X=X doesn't return TRUE when X=NULL, NULLS DISTINCT is consistent with that. ANSI NULL three value boolean algebra is insane, but it's insane in a pretty consistent way.