Hacker News new | ask | show | jobs
by eric4smith 1300 days ago
I had a database table where I used UUID as primary key. Big mistake. Haunts us to this day.

Not sortable. Takes a lot of space. Table relationships are annoying. Etc.

What we do instead is have a secondary UUID key and keep Bigint as primary keys. Then use the UUID column in the external context instead.

UUIDs are fine for 99.99999% of the time in your own domain.

Don’t expect universal uniqueness across all domains.

11 comments

The reason to use UUIDs as primary keys is to allow creating records including primary keys outside of the database before posting them, especially in distributed systems.

UUIDs are sortable, but don’t give you creation-order sorting (of course, its abusing bigint PKs to rely on them for that, too.) If you want creation-order sorting, storing a creation timestamp and sorting on that works, and I’ve never had a db that had a business requirement for creation order sorting and didn’t also have one for actual creation time.

I've become of the mindset that most records should have creation and update timestamps - even if you do not currently plan to use them. They might come in handy later, and the code to implement them in most languages/frameworks is trivial anyway.
These days ULID (https://github.com/ulid/spec, or any of the variants), or most recently the new UUID versions (https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...) give you creation-order sorting.
What's the benefit of having the primary key before the row is stored in the database?

Maybe concurrent inserts to multiple data stores so you don't need to wait for an initial ID from the database. You'd have to trust the client to give you a “good” UUID as well as the normal distributed problems of one of the RPCs failing. I know both Twitter (called Snowflake iirc) and Google have unique ID services for this use case.

> What’s the benefit of having the primary key before the row is stored in the database?

If I have a set of linked records in a relational schema representing a complex object, I can create them all with one round trip rather than multiple.

Also, large numbers of clients can insert in that way without contention, whereas if you use a sequence generator for PKs, it becomes a resource around which there is contention when creating rows.

> You’d have to trust the client to give you a “good” UUID

Sure, this lets you scale, e.g., backend service instances (which are db clients) without (as much as otherwise) contention in the db layer, its not usually something you would do with external, untrusted clients.

> I know both Twitter (called Snowflake iirc) and Google have unique ID services for this use case.

Snowflake was one of the inspirations for the newer (draft) UUID versions [0] (though, unlike them, it had a design constraint of fitting into 64 instead of 128 bits.)

[0] https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...

> I can create them all with one round trip rather than multiple.

I see. I'd lean towards using common table expressions. The first insert statement returns the primary key and other inserts can depend on the key. I do understand it's not a panacea and composing queries can be problematic.

> a sequence generator for PKs, it becomes a resource around which there is contention

As I understand, Postgres sequences don't block which leads to a different problem [1].

[1]: https://news.ycombinator.com/item?id=27843084

> Not sortable

That's not correct; its trivial to come up with an ordering, and I don't know of a database in practice that doesn't permit sorting on a UUID.

> Table relationships are annoying

… in SQL,

  user_id REFERENCES users
It's exactly the same, regardless of the type of the column…?

> Takes a lot of space

Yes … but also no. It's 16 B vs. a serial's 4 B, I grant, but compared to a varchar, it's immaterial. (And particular in comparison to the number of times I see people use a varchar for an enum…) Certainly there could be a case where a row is wide b/c of UUIDs, but in practice, rows are wide either because of the data, or because of poor design.

AFAIK UUIDs as pks in databases is extremely standard. I would suggest the biggest downside is debugging where writing id = 5 is much easier than id = 'xxxx-xxxx-xxxx-xxxx'.
On the other hand you won’t get wrong result if you wrote wrong table name. Was hit once. Deleted row from the wrong table.
yeah, I seen bugs in multi-db systems where a dataset gets accidentally composed from data from each due to aligning ids. Was a very strange afternoon.
Just curious to learn: when do you need sortable primary keys?
It makes pagination much more efficient and possible to completely use an index (re: data is pre-sorted) if you have some sort of increasing id, vs sorting in-memory on request. Even having some sort of indexed "createdAt" field, and sorting on that, isn't super reliable for pagination when the dataset is changing. You usually want a stable sort to make pagination nice, so you add an ID to it, but then this gets very inefficient w.r.t index use if the PK is a UUID since it is not monotonic/sortable.
UUIDs are sortable, and you can do efficient pagination on them. (The results wouldn't be in any meaningful order, but that's orthogonal.)
Yeah sorry I had a brain fart and was thinking of stable pagination w/ compound sort+filter. It doesn't matter if id is UUID, sigh.

Where date > y or (date eq y and id > x) order by date, id

This can't be completely fulfilled by an index. It'll have to sort/merge somevresults in memory, very CPU expensive.

More examples here https://www.mixmax.com/engineering/api-paging-built-the-righ...

In sqlserver (and others?) you can have one clustered index. Typically this is the primary key (doesn’t have to be though).

The clustered index is actually the physical order of the data stored on disk, so any new guid causes tremendous amounts of churn.

Not sure how modern this concern is, but I would guess a LOT of SASS companies have to consider this.

Quick edit: sequential GUIDs are obviously a thing and I believe alleviate a lot of these concerns. I do not believe that was an option in sql server 2008 r2 (a version that lived a LONG time).

A clustered index is certainly not actually the physical order of the data stored on the disk, at best it tends to approximate it by placing data associated with similar keys in similar blocks. You would have to actually physically reorganize your entire clustered table or index to get it in physical order, and often you lose performance by doing so.

The performance impact of randomly distributed keys on any ordered index on that key is certainly real though, so I agree with you there.

Almost every database algorithm is based on ordering and binary search.
… and for which UUIDs are sortable.

OP called them "not sortable", which is a poor phrasing, as they're sortable. What he's likely getting at is that they're not ordered by generation time, which matters to some folks.

When you need to get a sorted list or perhaps a range?

When do you need primary keys?

that doesn't really make sense though. how would you know the range to sort by? presumably you'd use the criterion directly to sort.

a better reason and justification to not use UUIDs is that they take up needless space if you're space constrained. they also have worse performance

Maybe you are doing some sort of cursor for pagination or whatever. You can use one ID and fetch the next 50 records. Then use the last ID in those 50 to fetch the next 50, etc.
but you can still do pagination with uuids, your rows can be sorted on any arbitrary criteria like created date or updated date.
But that would be slow, even with an index, as compared to using a clustered primary key. Especially when your data doesn't fit in ram.
Dates might not be unique but I can see other possibilities that would be fine.
When is an instance you'd need to sort by ID? How does it make table relationships annoying?
I’m not the parent but for me:

1. Sort by ID if you want to see things by insert order and are using serial/autoincrement for the PK.

2. Writing (or reading, or talking about) SQL queries in which the IDs are present.

I solved the first problem by using ULID at one point but in future I’d probably not use a UUID until I had a specific need for it. And as others have mentioned, if that need is about the outside world I’d probably go with an extra column.

> 1. Sort by ID if you want to see things by insert order and are using serial/autoincrement for the PK.

Most RDBMSes (possibly all of the big ones?) do this by default if you opt for no ORDER BY clause, however that IS then depending on undefined behavior which is bad.

> Takes a lot of space

Wait, did you store them as strings instead of as a native UUID type (or any sort of 128 bit integer type)?

Storing UUID's in their binary format quickly becomes a PITA for debugging or doing manual queries. Some DB's have functions that can convert to human-readable strings which make this less painful, but it's never going to be as simple as using integer id's.

As with all things, there are tradeoffs that must be considered before building the system.

Which DB doesn't support UUID as a type so I can make sure to never use it?
More than you'd think... including the venerable MySQL.
This is a good approach. I wonder if using a UUID-aware data type (like PostgreSQL's UUID type) would improve performance without making the second column necessary?
But to use both, you have to store both, to use it as a lookup, you have to [unique] index both; you can no longer easily partition (and guarantee uniqueness)

And an auto incrementing bigint doesn't guarantee order.

And if you need universal uniqueness across all domains, just pick a 256-bit random number using a CSPRNG like /dev/urandom.
I'm surprised to hear it's not sortable. Why can't they just memcmp it?
You can and databases do.

What people want is for it to be sortable by generation time, i.e., they want it to share the "larger values were generated later" property of timestamps (but without collisions) or SERIALs (but without the locking on the serial / independent generation).

Got it. Thanks for correction.
Everytime I used uuids before I always ended up with terrible index performance