Hacker News new | ask | show | jobs
by TeeWEE 711 days ago
Big serial is sequential and it’s very easy to guess the next number. So you got the problem of sequential key attack…

If you use only uuid in your outwards facing api then you still have the problem of slow queries. Since you need them to find the object (as mentioned below)

UUIDv7 has a random part, can be created distributedly, and indexes well.

It’s the best choice for modern application that support distributed data creation.

2 comments

Sequential numbers can not be used publicly.

Also, security can be built around not allowing querying records which are not yours.

I'm all for a little security through obscurity including UUIDs but it shouldn't be the sole thing. Easier to generate a UUID for the sequential and let the database do what it does best (relate many serials among each other).

The other part is being able to use what's built into the database out of the box without a lot more configuration.

Selfishly, I always appreciate learning more about Postgres though :)

You never expose the bigserial, you generate a ID (like UUID) for external use/identification and simply have an index over that column for fast selects.
If you have an index on the uuid anyways having a separate big serial field for PK doesn’t help that much.
As mentioned elsewhere, it ensures the ability to perform resumable and consistent batching queries across the data set without missing records.

Ordering over an insertion timestamp is not enough if two records may have the same timestamp: You may miss a record (or visit a record twice) across multiple queries.

This is solved sorting by timestamp first then by random PK UUID. Don't think a little simpler batch queries justify leaking time and quantity information or complexity of handling two types of IDs.
You wouldn't expose the numeric IDs publically, and ideally you'd use your database's automatic ID selection to avoid any complexity.

The UUID sorting works in the common case, but if you happen to end your batch near the current time, you still run the risk of losing a few records if the insert frequency is sufficiently high. Admittedly this is only a problem when you are batching through all the way to current insertions.

I agree with not baking more intelligence into a piece of data than needed, especially an index.
Having an index over the uuid is equivalent to it being a PK, so why would you bother having both?
Because it's much better for range queries and joins. When you inevitably need to take a snapshot of the table or migrate the schema somehow you'll be wishing you had something else other than a UUID as the PK.
This. Highly recommend using a numeric primary key + UUID. Using UUID relations internally can have some strategic advantages, but when UUIDv4 is used as the only primary key, you completely lose the ability to reliably iterate all records across multiple independent queries.

Also, the external thing isn't just for exposing it out to your own apps via APIs, but way more importantly for providing an unmistakable ID to store within external related systems. For example, in your Stripe metadata.

Doing this ensures that ID either exists in your own database or does not, regardless of database rollbacks, database inconsistencies etc. In those situations a numeric ID is a big question mark: Does this record correspond with the external system or was there a reuse of that ID?

I've been burnt taking over poorly managed systems that saved numeric IDs externally, and in trying to heal and migrate that data, ran into tons of problems because of ill-considered rollbacks of the database. At least after I leave the systems I build won't be subtly broken by such bad practices in the future.

Ha? Please elaborate.
When running a batched migration it is important to batch using a strictly monotonic field so that new rows wont get inserted in already processed range
It's not even necessarily it being strictly monotonic. That part does help though as you don't need to skip rows.

For me the bigger thing is the randomness. A uid being random for a given row means the opposite is true; any given index entry points to a completely random heap entry.

When backfilling this leads to massive write amplification. Consider a table with rows taking up 40 bytes, so roughly 200 entries per page. If I backfill 1k rows sorted by the id then under normal circumstances I'd expect to update 6-7 pages which is ~50kiB of heap writes.

Whereas if I do that sort of backfill with a uid then I'd expect to encounter each page on a separate row. That means 1k rows backfilled is going to be around 8MB of writes to the heap.

Strictly monotonic fields are quite expensive and the bigserial PK alone won't give you that.
Okay, but in a live DB, typically you won't have only inserts while migrating, won't you?
would creation/lastmod timestamps cover this requirement?
Amen (or similar)