Hacker News new | ask | show | jobs
by antender 1888 days ago
Also, from the point of query optimisation this is a really bad idea. Usually you DO actually care about size of fields in SQL databases, because something like BOOLEAN is usually stored as single byte (or bit in a bitfield) vs 4 bytes or even 8 in case of timestamp. This not only multiplies on disk usage by at least 4 times, but also makes ALL indexes using this field way bigger. Also boolean indexes can be compressed (or stored as bitmaps), while timestamp indexes contain lots of unique values, so they can't be. This is also the reason why serial IDs are way better than UUIDs for internal IDs.
4 comments

> This is also the reason why serial IDs are way better than UUIDs for internal IDs.

There are three core problems with that:

  a) Serial IDs are a nightmare for database merges, clustering or anything like that
  b) Serial IDs won't scale
  c) Serial IDs require management, whilst UUIDs can be produced anywhere (in DB, in frontend etc)
There is the KSUID[1] if people want a time-sortable thing that is near-enough to a UUID.

[1] https://github.com/segmentio/ksuid

You can do both. Use the UUIDs where you need the scalability and serial where you don’t.
> Serial IDs won't scale

You mean scaling into several machines? Yes, they do scale. Nothing requires that the values are always increasing and have no holes, so you can slice and cluster them at will (and many DBMS do exactly that).

I'm confused? Serial means "always increasing and having no holes" -- it's one thing after the other. What you're arguing for sounds like just IDs, not serial IDs.
Well, by your definition what people use as serial IDs isn't serial.

Databases have the concept of sequences, that are closer to your definition, but make no promises about holes (they normally don't generate holes by themselves, but there is no way to guarantee you won't lose numbers upon usage). It is common to use sequences to feed serial IDs, but not all DBMS do that and it's not a requirement in any way.

IMO if you’re going this route, may as well just implement a snowflake clone and get the best of all worlds.
Re: Snowflake clone I would humbly invite you to read the note in the KSUID Github, namely:

  *To fit into a 64-bit number space, Snowflake IDs and its derivatives require coordination to avoid collisions, which significantly increases the deployment complexity and operational burden.*
Therefore KSUID remains the best option.
I don't necessarily agree with your conclusion, but I will admit you can't put KSUID and Snowflake in a room and declare either of them the winner.

I would argue that doubling your key-space in the case of KSUIDs may have just as much of an impact as coordinating node ids in the case of snowflake (and in fact the snowflake technique only runs into coordination problems when you're at pretty extreme scale).

I think this is still workable with an expression/functional index, where the indexed expression is "happened_at IS NOT NULL".
Seems like a whole lot of extra trouble to be honest. What happens when you need to create a composite index?
You can do indexes that are composites of plain field values and/or expressions! At least in Postgres.

As for the trouble, I agree, it's a bit more involved.

Storing both boolean and timestamp with the correct constraints seems like a better solution to me.
This is the typical solution for MySQL. There are indexable virtual generated columns but don't seem to have been used/around enough to be battle tested.

Using just the nullable timestamp only lets you index on one of the boolean states: an index on (ts, col2) works for WHERE ts IS NULL ORDER BY col2, but WHERE ts IS NOT NULL ORDER BY col2 doesn't work well (without using generated boolean, etc).

Theoretically, this should compress quite well. In practice, I don't know how much compression db engines apply.