Hacker News new | ask | show | jobs
by magicalhippo 522 days ago
The result of comparisons involving NULL values can result[1][2] in UNKNOWN, and in PostgreSQL for example you can test[3] for this using IS UNKNOWN.

That said, as someone self-taught in SQL, I agree NULL was not a good choice.

Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.

[1]: https://stackoverflow.com/a/79270181

[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...

[3]: https://www.postgresql.org/docs/current/functions-comparison...

3 comments

SQL was developed in the 1970s, there’s no way they’d waste all those bytes to spell out UNKNOWN and INDETERMINATE.
Have you ever seen SQL? Blowing lots of bytes on a query was not a concern. They could have made a much more compact syntax with little effort, but it wouldn't have looked kind of like English the way SQL does.

You could imagine a world where instead of:

SELECT ( email, name, outstanding_balance ) FROM accounts WHERE outstanding_balance > 0 AND last_payment > 60 ORDER BY name

the queries looked more like:

accounts: outstanding_balance > 0 & last_payment > 60 => email, ^name, outstanding_balance

There were plenty of contemporary languages that looked like the latter, but few survive to this day. SQL's relative verbosity was not seen as a problem then and is definitely not one today.

Besides, if the verbosity was a problem they could easily shorten it to UNK. That would have been perfectly normal.

Also self-taught SQLer and I don't have an issue with NULL.

I also don't use UNIQUE constraints, so maybe that has something to do with it.

I don't have an issue as such, I was a fairly experienced developer first time I had to dabble with SQL, but sometimes it can still surprise.

For example I learned the hard way that the DB we use at work does not index NULL values.

And once in a while if I'm tired or stressed I might forget about UNKNOWN and thus that "Col <> 42" does not return rows where Col is NULL.

Not that better naming would prevent such surprises, but I still think the current naming is less than optimal from a pedagogical perspective. At least I see this at times when teaching our support folks SQL (many have domain background and not a technical background).

UNKNOWN isn’t always correct though. Let’s say your data is input by users filling out a form and some of the fields are allowed to be left blank. NULL captures both the case where the user intentionally left the field blank but also the case where they accidentally skipped that field.

So NULL can capture multiple distinct concepts: unknown values (say, as a result of a calculation), not applicable (where the data has been deliberately left out), and missing data (possibly due to instrumentation or user error).

Is it really desirable to combine "unknown" and "no value" into the same identifier? They seems like very distinct concepts. Using your form example you might have a name field that includes a spot for a suffix (John Doe III) for example that many people would leave blank because they don't have a suffix. In that case you should encode it as a NULL, but if they fail to fill in their family name then that's UNKNOWN.

If you do it this way you can avoid some confusion as to what NULL represents.

John Doe NULL IS EQUAL John Doe NULL

John Doe NULL NOT EQUAL John Doe UNKNOWN

John Doe UNKNOWN NOT EQUAL John Doe UNKNOWN

Determining if any particular input is NULL or UNKNOWN is a tricky problem, but at least this gets the programmer thinking about it up front and structuring their code to behave in the sanest possible manner.