Hacker News new | ask | show | jobs
by demurgos 523 days ago
> select null = null; returns NULL, because each NULL is basically a placeholder representing any “unknown value”. Two unknown values are not necessarily the same value; we can’t say that they are equal, because we don’t know the value of either of them.

Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.

Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.

0: https://www.postgresql.org/docs/15/sql-createtable.html

5 comments

SQL NULL doesn’t behave like “unknown” in all contexts. That’s one issue of NULL, that it doesn’t really have consistent semantics.
Furthermore if null only means unknown then we need a value for “known absent”, there’s a reason why null is so often used as that.
From a purely relational perspective, if some piece of data can be absent, it's a 1:N relation where N<=1, and ought to be encoded as such.

(Of course, this is rather awkward in practice, and when NULL is there, it's inevitably going to be used for that instead.)

It is encoded as such. That's why most columns are made nullable. It's crazy to say you need to use the full power of a 1:N relation with some child table when you know N cannot be greater than 1, when a nullable column already exactly encodes a 1:(0..1) relation. I'm not trying to shill for null here: one of null's great problems is exactly the fact that null can represent "unknown", "known absent", "not applicable", or even some sentinel-ish "other" escape condition, each with their own subtle differences in how they should be handled.

Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!

I don't know if it's "patently absurd" given that it's a well-known C.J. Date take on NULLs in SQL. You may disagree with him on this - I do - but the very fact that the father of relational algebra has this take should be sufficient evidence that it's not a trivial question to be dismissed without consideration.
But the "known absent" value is going to be different for different domains. For example, in EEOC databases the "known absent" value for a race would be "declined to answer". In a database of test scores, it might be "Didn't complete", but it could also be "was absent from class on exam day" so SQL can't specify what that is. On the other hand "this value is unknown" can use the same marker in all domains, and SQL chose NULL as that marker. To be completely strict about it, "have a value/don't have a value" is one piece of data if that's something you care about and "what is that value" is another one. So in an ideal system, you should have a column for "value is [present | known absent | unknown]" and a separate column for the actual value when "value is present"

Most of the time it's not that important and people can and do shortcut "null" to mean "not present" but then the issues with using null in equality statements is a result of taking the short cut, not necessarily with the logic around null.

Do you actually need that in a Boolean context? It would only be useful for evaluating self-referent claims like "this sentence is false".
Your questions might be relevant if null were limited to boolean contexts.

It’s not.

We could really use some kind of "polymorphic table" that can represent a sum type in a convenient way.
The problem is that in practice in a database NULL is a placeholder for a missing value, not an unknown value.
Best way to think of NULL is "no value." Not "unknown value," as that implies that it is a value you just don't know what it is. Not "missing" value as that even the notion of a value being "missing" tells you something. NULL is no value. It's like a black hole, it consumes anything it touches. Any expression involving a NULL term becomes NULL.
And to further apply semantics to this just to be snide, the value is only "missing" if it could possibly ever be defined for that tuple. There are cases where you expect the value to be "missing", and it not being "missing" would be considered a data integrity issue. Fun.
Yes I should have rather written “an absence of data”. But still not “unknown”.
The result of comparisons involving NULL values can result[1][2] in UNKNOWN, and in PostgreSQL for example you can test[3] for this using IS UNKNOWN.

That said, as someone self-taught in SQL, I agree NULL was not a good choice.

Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.

[1]: https://stackoverflow.com/a/79270181

[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...

[3]: https://www.postgresql.org/docs/current/functions-comparison...

SQL was developed in the 1970s, there’s no way they’d waste all those bytes to spell out UNKNOWN and INDETERMINATE.
Have you ever seen SQL? Blowing lots of bytes on a query was not a concern. They could have made a much more compact syntax with little effort, but it wouldn't have looked kind of like English the way SQL does.

You could imagine a world where instead of:

SELECT ( email, name, outstanding_balance ) FROM accounts WHERE outstanding_balance > 0 AND last_payment > 60 ORDER BY name

the queries looked more like:

accounts: outstanding_balance > 0 & last_payment > 60 => email, ^name, outstanding_balance

There were plenty of contemporary languages that looked like the latter, but few survive to this day. SQL's relative verbosity was not seen as a problem then and is definitely not one today.

Besides, if the verbosity was a problem they could easily shorten it to UNK. That would have been perfectly normal.

Also self-taught SQLer and I don't have an issue with NULL.

I also don't use UNIQUE constraints, so maybe that has something to do with it.

I don't have an issue as such, I was a fairly experienced developer first time I had to dabble with SQL, but sometimes it can still surprise.

For example I learned the hard way that the DB we use at work does not index NULL values.

And once in a while if I'm tired or stressed I might forget about UNKNOWN and thus that "Col <> 42" does not return rows where Col is NULL.

Not that better naming would prevent such surprises, but I still think the current naming is less than optimal from a pedagogical perspective. At least I see this at times when teaching our support folks SQL (many have domain background and not a technical background).

UNKNOWN isn’t always correct though. Let’s say your data is input by users filling out a form and some of the fields are allowed to be left blank. NULL captures both the case where the user intentionally left the field blank but also the case where they accidentally skipped that field.

So NULL can capture multiple distinct concepts: unknown values (say, as a result of a calculation), not applicable (where the data has been deliberately left out), and missing data (possibly due to instrumentation or user error).

Is it really desirable to combine "unknown" and "no value" into the same identifier? They seems like very distinct concepts. Using your form example you might have a name field that includes a spot for a suffix (John Doe III) for example that many people would leave blank because they don't have a suffix. In that case you should encode it as a NULL, but if they fail to fill in their family name then that's UNKNOWN.

If you do it this way you can avoid some confusion as to what NULL represents.

John Doe NULL IS EQUAL John Doe NULL

John Doe NULL NOT EQUAL John Doe UNKNOWN

John Doe UNKNOWN NOT EQUAL John Doe UNKNOWN

Determining if any particular input is NULL or UNKNOWN is a tricky problem, but at least this gets the programmer thinking about it up front and structuring their code to behave in the sanest possible manner.

>also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.

Why would anyone want to use another database?

Simplicity. PG is often overkill for simple apps, where MySQL/Maria/et al is easier to maintain, or even SQLite for very simple apps where zero DB maintenance is preferable.
Why would you say MySQL/Maria/et al are easier to maintain for simple apps than PG?
The main pain point for smaller apps is that every major Postgres version requires an explicit migration of the underlying data representation. I get why it's there, but for simpler apps I would appreciate a flag to do it transparently.
I'm not sure what you mean. I have migrated versions without having to update any applications that connects to it?

Maybe it is a driver specific issue? I have used Python/Java, and haven't updated any of my code or dependencies because of a major Postgre update

It's not client side, it's server side. The binary format between Postgres major versions is generally not compatible so you need to do a pg_dump from the old database and do a pg_restore into the new one. pg_upgrade [1] can let you bypass this by migrating the binary data directly, but still requires having both the new and old postgres versions installed. There's also things you can do with replication, but since we're talking simplicity I don't think that really applies.

Personally I think upgrades are the one thing MySQL has on Postgres at this point.

[1] https://www.postgresql.org/docs/current/pgupgrade.html

It requires manual interventions because the upgrade process is basically dump + restore. MySQL and MariaDB upgrade between major versions automatically — you simply install the next version (or change the version tag in your container definition) and restart the server.

Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).

The GP is complaining about the need to run the upgrade cluster command, and reconfigure your server when you want to use a different version of postgres itself. MySql does it transparently if you just open the database in the new server.

Of course, lots of people also think it's a feature. But those aren't very vocal. Anyway, this is a main reason why people keep using old postgres versions, they have to think about upgrading, and they often don't.

You need to dump the database on the old PostgreSQL version and then load the dump on the new PostgreSQL version. Some other database servers can just use the old table data or migrate the data files on the fly transparently.
Do you even have to upgrade to the latest major version? What does that do, on average?
Usually its a skill issue
VACUUM
If your data's large and changing enough that you have to care about vacuuming, any reasonable database is going to require some tuning, tending and management.

I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.

Sure, it's never going to be plug and play, but it doesn't mean that all the issues will be equivalent. Vacuuming doesn't really have an equivalent in say, MySQL. It's something you don't have to worry about if you use the latter.

For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.

Also the reverse — MySQL et al support much more complex replication topologies out of the box, including multi master for the very rare use case when you need it.

It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.

* legacy applications or vendor lock-in

* use of a cloud provider that favours another database (like SQL server)

* some people claim mysql/maria is faster for them, simpler, or has a better support for replication

* use of sqlite for lightweight or portable apps

* domain specific databases that still use sql as their query language

* someone may want to use another database for fun or to learn something new

SQLite has its own use cases. And MySQL was all-around better than Postgres in the past, so it still has more usage in some areas. Nowadays I'll take Postgres over MySQL by default, but it's not a huge difference.
Replication
Introducing “unknown” feels like another kind of hell like undefined in JavaScript.
JavaScript's undefined is great. It's sort of similar to a maybe monad. Or IEEE 754 NaN. JS could have nicer mechanisms to handle undefined though.
Just to clarify, I'm not advocating to introduce a new `unknown` keyword. I'm saying that the existing `null` in SQL was not named properly and that the name `unknown` would have been more fitting. SQL's `null` already has the semantics of `unknown` as explained in the part of the article that I quoted.
SQL's use of "null" is probably one of the oldest instances of that concept in computing. It's exactly equivalent to unknown. That is its definition.
SQL NULL is not "exactly equivalent" to unknown. E.g. in an outer join, there's nothing unknown about the result that is missing a row from one side, yet SQL semantics is to fill it with nulls.

In practice, it behaves as "unknown" in some contexts, as "missing value" in other contexts, and sometimes it's just plain WTF like SUM() returning NULL rather than 0 if there are no rows.

Really? I know that SQL is old but I would have expected `null` to refer to pointers at first.

Going by Wikipedia, I see that SQL is from 1974 and C from 1972. Were there earlier uses/drafts where `null` is "unknown" instead of "unset"?

The term "null" comes from C. A. R. Hoare, who has proposed in November 1965 various additions to the programming language ALGOL 60, including pointers (called references by Hoare, "pointer" comes from IBM PL/I, in July 1966), and including a special value "null" for references a.k.a. pointers, for denoting unknown or not applicable values.

C. A. R. Hoare, 1965-11:

"In order to enable references to represent partial functional relationships, i.e. ones which do not necessarily yield a value, a special reference value null is introduced. This value fails to refer to a record, and any attempt to use it to refer to a record leads to an undefined result."

In his proposal, Hoare has followed the COBOL 60 terminology for some of the additions to ALGOL, i.e. "record" instead of the word "structure", introduced by IBM PL/I in 1964-12, and "record class" instead of "structure type", hence the "class" of SIMULA 67, from where the word "class" has spread into all OOP languages.

In Hoare's proposal, references a.k.a. pointers could point only to values belonging to user-defined types, i.e. records a.k.a. structures, not to primitive types.

I wouldn't necessarily define `null` as "unknown" -- it's just "no value" -- which is really the same thing and also somewhat equivalent to "unset". But null pointers aren't unset as pointers aren't initialized to null in C and you can explicitly set a pointer to null.

E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.

In nth normal form, you can't have 'no value'. That would mean your model is wrong. In academic relational data books, null does mean "unknown". There is a value, we just don't know what it is (yet).

If there might actually not be such a value, you're supposed to change your schema to reflect that.

You'd also have to ask when NULL came into common use in C (to which I do not know the answer). AFAIK NULL was not considered to be standard until C89. As far as I'm aware, all C compilers implement it as a #define in a standard header somewhere; it's not actually part of the core language itself.

I wonder who first added that macro? Was it there from the beginning?

Just random thoughts...