| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by demurgos 569 days ago

> select null = null; returns NULL, because each NULL is basically a placeholder representing any “unknown value”. Two unknown values are not necessarily the same value; we can’t say that they are equal, because we don’t know the value of either of them.

Agreed with all of this, it would probably have been better if they were named `unknown` instead of reusing the `null` keyword.

Note also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.

0: https://www.postgresql.org/docs/15/sql-createtable.html

5 comments

layer8 569 days ago

SQL NULL doesn’t behave like “unknown” in all contexts. That’s one issue of NULL, that it doesn’t really have consistent semantics.

masklinn 569 days ago

Furthermore if null only means unknown then we need a value for “known absent”, there’s a reason why null is so often used as that.

int_19h 569 days ago

From a purely relational perspective, if some piece of data can be absent, it's a 1:N relation where N<=1, and ought to be encoded as such.

(Of course, this is rather awkward in practice, and when NULL is there, it's inevitably going to be used for that instead.)

feoren 569 days ago

It is encoded as such. That's why most columns are made nullable. It's crazy to say you need to use the full power of a 1:N relation with some child table when you know N cannot be greater than 1, when a nullable column already exactly encodes a 1:(0..1) relation. I'm not trying to shill for null here: one of null's great problems is exactly the fact that null can represent "unknown", "known absent", "not applicable", or even some sentinel-ish "other" escape condition, each with their own subtle differences in how they should be handled.

Null has tons of problems, of course. But it's patently absurd to claim that you "ought to be" making a different 1-column table with a unique foreign key or shared primary key for every column that might be absent, because of some vague appeal to the fact that you can write it as a 1:N relation with N<=1. You can just as easily claim that every non-nullable column is a 1:N relation where N==1 and "should be encoded as such". It is encoded as such! That's what a column is!

int_19h 565 days ago

I don't know if it's "patently absurd" given that it's a well-known C.J. Date take on NULLs in SQL. You may disagree with him on this - I do - but the very fact that the father of relational algebra has this take should be sufficient evidence that it's not a trivial question to be dismissed without consideration.

tpmoney 568 days ago

But the "known absent" value is going to be different for different domains. For example, in EEOC databases the "known absent" value for a race would be "declined to answer". In a database of test scores, it might be "Didn't complete", but it could also be "was absent from class on exam day" so SQL can't specify what that is. On the other hand "this value is unknown" can use the same marker in all domains, and SQL chose NULL as that marker. To be completely strict about it, "have a value/don't have a value" is one piece of data if that's something you care about and "what is that value" is another one. So in an ideal system, you should have a column for "value is [present | known absent | unknown]" and a separate column for the actual value when "value is present"

Most of the time it's not that important and people can and do shortcut "null" to mean "not present" but then the issues with using null in equality statements is a result of taking the short cut, not necessarily with the logic around null.

thaumasiotes 569 days ago

Do you actually need that in a Boolean context? It would only be useful for evaluating self-referent claims like "this sentence is false".

masklinn 569 days ago

Your questions might be relevant if null were limited to boolean contexts.

It’s not.

marcosdumay 568 days ago

We could really use some kind of "polymorphic table" that can represent a sum type in a convenient way.

cm2187 569 days ago

The problem is that in practice in a database NULL is a placeholder for a missing value, not an unknown value.

SoftTalker 568 days ago

Best way to think of NULL is "no value." Not "unknown value," as that implies that it is a value you just don't know what it is. Not "missing" value as that even the notion of a value being "missing" tells you something. NULL is no value. It's like a black hole, it consumes anything it touches. Any expression involving a NULL term becomes NULL.

bballer 568 days ago

And to further apply semantics to this just to be snide, the value is only "missing" if it could possibly ever be defined for that tuple. There are cases where you expect the value to be "missing", and it not being "missing" would be considered a data integrity issue. Fun.

cm2187 568 days ago

Yes I should have rather written “an absence of data”. But still not “unknown”.

magicalhippo 569 days ago

The result of comparisons involving NULL values can result[1][2] in UNKNOWN, and in PostgreSQL for example you can test[3] for this using IS UNKNOWN.

That said, as someone self-taught in SQL, I agree NULL was not a good choice.

Replacing NULL with UNKNOWN and the third boolean value as INDETERMINATE for example would have been better.

[1]: https://stackoverflow.com/a/79270181

[2]: https://learn.microsoft.com/en-us/sql/t-sql/language-element...

[3]: https://www.postgresql.org/docs/current/functions-comparison...

labster 569 days ago

SQL was developed in the 1970s, there’s no way they’d waste all those bytes to spell out UNKNOWN and INDETERMINATE.

jandrese 568 days ago

Have you ever seen SQL? Blowing lots of bytes on a query was not a concern. They could have made a much more compact syntax with little effort, but it wouldn't have looked kind of like English the way SQL does.

You could imagine a world where instead of:

SELECT ( email, name, outstanding_balance ) FROM accounts WHERE outstanding_balance > 0 AND last_payment > 60 ORDER BY name

the queries looked more like:

accounts: outstanding_balance > 0 & last_payment > 60 => email, ^name, outstanding_balance

There were plenty of contemporary languages that looked like the latter, but few survive to this day. SQL's relative verbosity was not seen as a problem then and is definitely not one today.

Besides, if the verbosity was a problem they could easily shorten it to UNK. That would have been perfectly normal.

otteromkram 569 days ago

Also self-taught SQLer and I don't have an issue with NULL.

I also don't use UNIQUE constraints, so maybe that has something to do with it.

magicalhippo 569 days ago

I don't have an issue as such, I was a fairly experienced developer first time I had to dabble with SQL, but sometimes it can still surprise.

For example I learned the hard way that the DB we use at work does not index NULL values.

And once in a while if I'm tired or stressed I might forget about UNKNOWN and thus that "Col <> 42" does not return rows where Col is NULL.

Not that better naming would prevent such surprises, but I still think the current naming is less than optimal from a pedagogical perspective. At least I see this at times when teaching our support folks SQL (many have domain background and not a technical background).

chongli 569 days ago

UNKNOWN isn’t always correct though. Let’s say your data is input by users filling out a form and some of the fields are allowed to be left blank. NULL captures both the case where the user intentionally left the field blank but also the case where they accidentally skipped that field.

So NULL can capture multiple distinct concepts: unknown values (say, as a result of a calculation), not applicable (where the data has been deliberately left out), and missing data (possibly due to instrumentation or user error).

jandrese 568 days ago

Is it really desirable to combine "unknown" and "no value" into the same identifier? They seems like very distinct concepts. Using your form example you might have a name field that includes a spot for a suffix (John Doe III) for example that many people would leave blank because they don't have a suffix. In that case you should encode it as a NULL, but if they fail to fill in their family name then that's UNKNOWN.

If you do it this way you can avoid some confusion as to what NULL represents.

John Doe NULL IS EQUAL John Doe NULL

John Doe NULL NOT EQUAL John Doe UNKNOWN

John Doe UNKNOWN NOT EQUAL John Doe UNKNOWN

Determining if any particular input is NULL or UNKNOWN is a tricky problem, but at least this gets the programmer thinking about it up front and structuring their code to behave in the sanest possible manner.

NoMoreNicksLeft 569 days ago

>also that since Postgresql 15, you can use `NULLS NOT DISTINCT` when creating a unique index [0]. I'm less familiar with other databases.

Why would anyone want to use another database?

stronglikedan 569 days ago

Simplicity. PG is often overkill for simple apps, where MySQL/Maria/et al is easier to maintain, or even SQLite for very simple apps where zero DB maintenance is preferable.

lcnPylGDnU4H9OF 569 days ago

Why would you say MySQL/Maria/et al are easier to maintain for simple apps than PG?

demurgos 569 days ago

The main pain point for smaller apps is that every major Postgres version requires an explicit migration of the underlying data representation. I get why it's there, but for simpler apps I would appreciate a flag to do it transparently.

throwaway10235 569 days ago

I'm not sure what you mean. I have migrated versions without having to update any applications that connects to it?

Maybe it is a driver specific issue? I have used Python/Java, and haven't updated any of my code or dependencies because of a major Postgre update

Volundr 569 days ago

It's not client side, it's server side. The binary format between Postgres major versions is generally not compatible so you need to do a pg_dump from the old database and do a pg_restore into the new one. pg_upgrade [1] can let you bypass this by migrating the binary data directly, but still requires having both the new and old postgres versions installed. There's also things you can do with replication, but since we're talking simplicity I don't think that really applies.

Personally I think upgrades are the one thing MySQL has on Postgres at this point.

[1] https://www.postgresql.org/docs/current/pgupgrade.html

homebrewer 569 days ago

It requires manual interventions because the upgrade process is basically dump + restore. MySQL and MariaDB upgrade between major versions automatically — you simply install the next version (or change the version tag in your container definition) and restart the server.

Usually it takes almost no time, altought might be just as slow as PG when major changes to the data format are introduced. The only example I can remember is 8.0 when oracle completely rewrote the data format (making things like atomic ddl possible).

marcosdumay 568 days ago

The GP is complaining about the need to run the upgrade cluster command, and reconfigure your server when you want to use a different version of postgres itself. MySql does it transparently if you just open the database in the new server.

Of course, lots of people also think it's a feature. But those aren't very vocal. Anyway, this is a main reason why people keep using old postgres versions, they have to think about upgrading, and they often don't.

panzi 569 days ago

You need to dump the database on the old PostgreSQL version and then load the dump on the new PostgreSQL version. Some other database servers can just use the old table data or migrate the data files on the fly transparently.

wruza 569 days ago

Do you even have to upgrade to the latest major version? What does that do, on average?

phplovesong 569 days ago

Usually its a skill issue

badlibrarian 569 days ago

VACUUM

kstrauser 569 days ago

If your data's large and changing enough that you have to care about vacuuming, any reasonable database is going to require some tuning, tending and management.

I'd posit that only a tiny fraction of PostgreSQL uses have to know or care that vacuuming is a thing because the autovacuum default handle it for them.

mardifoufs 569 days ago

Sure, it's never going to be plug and play, but it doesn't mean that all the issues will be equivalent. Vacuuming doesn't really have an equivalent in say, MySQL. It's something you don't have to worry about if you use the latter.

For example, HA and clustering will always be challenging to deploy/maintain, but you will still have a harder time doing that with postgres than with MySQL. Postgres also has a lot of benefits obviously, though.

homebrewer 569 days ago

Also the reverse — MySQL et al support much more complex replication topologies out of the box, including multi master for the very rare use case when you need it.

It's also much easier to tune, most database instances require setting innodb_buffer_pool_size, and that's basically it. Newer versions can even set it automatically if you're fine with consuming all memory on that machine, thus requiring no tuning at all.

poincaredisk 569 days ago

* legacy applications or vendor lock-in

* use of a cloud provider that favours another database (like SQL server)

* some people claim mysql/maria is faster for them, simpler, or has a better support for replication

* use of sqlite for lightweight or portable apps

* domain specific databases that still use sql as their query language

* someone may want to use another database for fun or to learn something new

hot_gril 569 days ago

SQLite has its own use cases. And MySQL was all-around better than Postgres in the past, so it still has more usage in some areas. Nowadays I'll take Postgres over MySQL by default, but it's not a huge difference.

password4321 569 days ago

Replication

duncan-donuts 569 days ago

Introducing “unknown” feels like another kind of hell like undefined in JavaScript.

jampekka 569 days ago

JavaScript's undefined is great. It's sort of similar to a maybe monad. Or IEEE 754 NaN. JS could have nicer mechanisms to handle undefined though.

demurgos 569 days ago

Just to clarify, I'm not advocating to introduce a new `unknown` keyword. I'm saying that the existing `null` in SQL was not named properly and that the name `unknown` would have been more fitting. SQL's `null` already has the semantics of `unknown` as explained in the part of the article that I quoted.

wvenable 569 days ago

SQL's use of "null" is probably one of the oldest instances of that concept in computing. It's exactly equivalent to unknown. That is its definition.

int_19h 569 days ago

SQL NULL is not "exactly equivalent" to unknown. E.g. in an outer join, there's nothing unknown about the result that is missing a row from one side, yet SQL semantics is to fill it with nulls.

In practice, it behaves as "unknown" in some contexts, as "missing value" in other contexts, and sometimes it's just plain WTF like SUM() returning NULL rather than 0 if there are no rows.

demurgos 569 days ago

Really? I know that SQL is old but I would have expected `null` to refer to pointers at first.

Going by Wikipedia, I see that SQL is from 1974 and C from 1972. Were there earlier uses/drafts where `null` is "unknown" instead of "unset"?

adrian_b 569 days ago

The term "null" comes from C. A. R. Hoare, who has proposed in November 1965 various additions to the programming language ALGOL 60, including pointers (called references by Hoare, "pointer" comes from IBM PL/I, in July 1966), and including a special value "null" for references a.k.a. pointers, for denoting unknown or not applicable values.

C. A. R. Hoare, 1965-11:

"In order to enable references to represent partial functional relationships, i.e. ones which do not necessarily yield a value, a special reference value null is introduced. This value fails to refer to a record, and any attempt to use it to refer to a record leads to an undefined result."

In his proposal, Hoare has followed the COBOL 60 terminology for some of the additions to ALGOL, i.e. "record" instead of the word "structure", introduced by IBM PL/I in 1964-12, and "record class" instead of "structure type", hence the "class" of SIMULA 67, from where the word "class" has spread into all OOP languages.

In Hoare's proposal, references a.k.a. pointers could point only to values belonging to user-defined types, i.e. records a.k.a. structures, not to primitive types.

wvenable 569 days ago

I wouldn't necessarily define `null` as "unknown" -- it's just "no value" -- which is really the same thing and also somewhat equivalent to "unset". But null pointers aren't unset as pointers aren't initialized to null in C and you can explicitly set a pointer to null.

E.F. Codd added nulls to relational model in 1970 so that does pre-date C. The concept is even older than that I imagine.

recursive 569 days ago

In nth normal form, you can't have 'no value'. That would mean your model is wrong. In academic relational data books, null does mean "unknown". There is a value, we just don't know what it is (yet).

If there might actually not be such a value, you're supposed to change your schema to reflect that.

floating-io 569 days ago

You'd also have to ask when NULL came into common use in C (to which I do not know the answer). AFAIK NULL was not considered to be standard until C89. As far as I'm aware, all C compilers implement it as a #define in a standard header somewhere; it's not actually part of the core language itself.

I wonder who first added that macro? Was it there from the beginning?

Just random thoughts...