Hacker News new | ask | show | jobs
by crazygringo 524 days ago
That doesn't address anything in the second half of the post though, starting with this pull quote:

> The fact that NULLs are distinct for UNIQUE columns but are indistinct for SELECT DISTINCT and UNION continues to be puzzling. It seems that NULLs should be either distinct everywhere or nowhere. And the SQL standards documents suggest that NULLs should be distinct everywhere. Yet as of this writing, no SQL engine tested treats NULLs as distinct in a SELECT DISTINCT statement or in a UNION.

7 comments

This is confusing when you know that NULLs are not comparable, but it makes some sense if you consider the result of distinct/union as the output of a GROUP BY. You can consider everything that's NULL to be part of the same group, all the values are unknown.

So NULLs are not comparable but they are part of the same set.

If nulls are distinct then group by should not group them together, this just ignores the problem. Why does group by treat them as equal?
It doesn't treat them as equal, it treats them as one group. It does this because the result is more useful.

It is not the case that nulls are always the same as one another. It is also not the case that nulls are always distinct from each other. Thus, the normal rule of grouping, that same values are combined and distinct values make different groups, does not apply. Another principle is needed to determine the result of grouping with nulls.

Logic which allows for an unknown value can't be shoehorned into always giving definite true or false answers.

Nulls are not necessarily distinct.

I believe this confusion is confusing the tool with the thing being measured. For simplicity, I will use the analogy of a record (stored as a row in the database) as an observation in a scientific experiment. If the tool was able to record a value, I enter a value like 579.13. If the tool was not able to record a value, the tool will enter NULL. I make a total of one hundred observations. Of one hundred rows, some have values and some are NULL.

Are NULLs distinct values? No, they are simply a failure in measurement; it is like asking if all errors are distinct or the same. Are NULLS part of the same dataset? Yes, because they are all observations for the same scientific experiment. What does it mean when "select distinct ... " returns several rows for known/measurable values and but only one row for NULL? If this is confusing, the scientist can update the rows and substitute "UNKNOWN/ERROR" for every NULL. When you do "select distinct ...", you will get the same thing. It will return several rows for known/measurable values and but only one row for "UNKNOWN/ERROR".

This is where the foundation of a relational database semantics in set theory shows through. You can’t model the behaviour with pure boolean logic.

In the SQL spec by default unique indexes consider nulls distinct because you’re adding tuple to the relation, and this is done by equality.

When doing a select distinct or group by you’re not doing a boolean grouping, you’re doing a set projection. NULLs are considered part of the set of “unknown values”, so NULLs are grouped together but they’re still not equal to each other.

The behaviour is mathematically well defined, but it doesn’t match boolean logic.

I’ve been dealing with databases for well over 20 years now and I can only think of a couple of times when this behaviour wasn’t wanted, so I think it’s the right choice.

That's because "different" and "distinct" don't mean the same thing.

Two unknown values are assumed to be different, but they are not distinct from each other.

For example, take two boxes, in each box is a die, the value of the box is the value shown on the die inside. You don't know the value since you don't see the die, it may even change as you manipulate the box, so it is unknown, NULL in SQL. Because of that, you assume the two boxes have different values. They are, however, indistinguable, so, not distinct. All you need to know is that you hold two boxes with dices in it, which one you hold doesn't matter, and that's what "SELECT DISTINCT" tells you: that you have two boxes and it doesn't matter which is which, even though they have different values.

>That's because "different" and "distinct" don't mean the same thing.

The literal definition distinct is:

>recognizably different in nature from something else of a similar type.

If you want to get down to it nothing is "equal" or the same.

Is a temperature measurement 25C the same as another of 25C? No these measurements are an approximation of the actual values which are actually not equal to each other they are distinct they have just been lumped into the same 25C group due to the resolution of measurement yet equality works just fine on that value in sql.

I have used SQL for a long time null handling is weird and inconsistent and a waste of time. For all the language bugs due to the existence of null at least I can count on null=null and not write garbage like value=param or (param is null and value is null)

> The literal definition distinct is

Irrelevant. What matters is the meaning in the context of SQL.

> weird and inconsistent and a waste of time. For all the language bugs due to the existence of null

There are necessary, semantic cases that need to be dealt with. How else would you do it?

Also, it's really weird to use "bugs" to refer to well defined and well documented behavior.

I wanted to briefly reinforce this point with the fact that SQL has multiple equality operators - there is both `=` and `IS NOT DISTINCT FROM`. The later operator will treat null values as equal in alignment with the `DISTINCT` and `DISTINCT ON` operators.

It is extremely easy using partial uniques and the proper operators to treat nulls as non-distinct values and the approach we have allows some very important functionality that aligns with standard statistical data treatment which was influential to how SQL logic was originally designed.

> Is a temperature measurement 25C the same as another of 25C?

Yes, the measurements are the same.

The actual temperatures probably are not, but measurements are not the same as the thing measured.

>Yes, the measurements are the same.

By the logic two unknown (null) measurements are the same regardless of the actual value which I agree with.

An unknown measurement isn't a measurement value its a statement of (lack of) knowledge about a measurement, that doesn't tell you what the measurement is. Knowledge about a measurement is as different from the measurement as the measurement itself is from the thing measured.

Whether two unknown measurements are the same is unknown.

Whether two measurements of 25C are the same is unknown, these are just values recorded in a database. 25 is a value, null is a value.

The values in the db are the same in both cases which is what I would like my db language to deal with and not make assumptions about what that value actually means.

I see no value in treating null special when in comes to equality in a sql db, in fact it is a hinderance that it does so in my experience.

Changing the emphasis.

> recognizably different in nature from something else of a similar type.

But anyways, the point wasn't to justify the choices of SQL but rather as a way to make intuitive sense of its logic. SQL is one of the oldest and most successful programming languages in existence, we are not going to change it, and it is not going to disappear anytime soon, so we have to go with it, like it or not. There have been some attempts at alternatives, both at changing the paradigm (NoSQL) and at cleaning up the language, which, to be fair, would be a good thing, but without much success. The relational paradigm just works, and SQL is usable enough to make the cost of switching not worth it.

Edit:

And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic. You are using it wrong basically. Sometimes, it is the right thing to do, but if that pattern starts appearing all over the place, it is usually a result of thinking "NULL is broken, I have to use this pattern to handle NULL properly". That's cargo culting, don't fix problems you don't understand by copy-pasting code you don't understand.

Note: this is not addressed to "you" in particular, there can be good reasons, no offense intended. But I think that in general, it is a code smell.

If it is not recognizably different than it is the same in that context correct?

Two measurements of 25C are not recognizably different therefore they are equal, correct, regardless if the actual temperatures are not the same?

Two measurements of unknown are not recognizably different therefore they are equal in the context of the database.

Having null!=null has never been intuitive to me especially since every other programming language treats them equal. I am not hoping this gets changed, I know SQL is to far along for that, I can still complain about it and agree its wierd.

>And writing things like "value=param or (param is null and value is null)" is usually the sign of a poor understanding of the NULL logic.

It's needed with parametrized sql when your db doesn't support "is not distinct from" which is itself a silly way to just write '=' or '==' like a normal programming language. The distinct predict exist for this very reason to have yet another way to express equality that includes nulls: https://modern-sql.com/caniuse/T151

MS SQL Server treats NULLs as indistinct for UNIQUE constraints, SELECT DISTINCT and for UNION.

Indeed, the sqlite page the pull quote is from says as much.

> MS SQL Server treats NULLs as indistinct for UNIQUE constraints

Postgres lets you control that behaviour when creating the constraint (or index)

Although only in relatively recent versions. I had to hack around this in TypeOrm, because their Postgres backend hasn't exposed the option yet.
Yes, ergonomics dictates some weird behavior for nulls.

Luckily, Postgres nowadays lets you declare the behavior of each null in unique constraints, like it should be. We can expect this to creep down to the other DBMSs with time.

Making nulls distinct on a "select distinct" or a "group by" would be completely useless.

If you're compromising on your high-minded and unorthodox purist concept for ergonomics, you may as well finish the job of ergonomics and just use normal nulls where X=X is true.
It's possible that this is due to the underlying implementation.

In a unique column normally you'll have an index, so NULL becomes a special value in an index, but in SELECT DISTINCT you probably won't have an index, which means a full scan is performed, then every row has to be compared with every other row.

If you're including possibly NULL columns in a distinct or group by and you want to treat them in a particular way, use the COALESCE() or NVL() or whatever similar function to give a real value to the NULL for that purpose.
Isn't "select distinct" wildly frowned upon anyway? It's the same as "group by", but with less options...
It’s not and it’s not, respectively.
Yeah nothing wrong with “select distinct” itself if it’s used correctly for its intended reasons.

But when I see select distinct at the start of a big and complex query, I do immediately suspect that the developer might have missed some join condition down the line and “got too many rows” back from the query. And since the rows look like duplicates due to the missing join predicate, for a junior (or careless) developer, the quick “solution” is to just apply distinct in the top level query to get rid of these pesky duplicates, declare success and move on.