Hacker News new | ask | show | jobs
by drenei 1618 days ago
Bad for performance as primary keys.

But, still provide strong value as a unique identifier which is what makes them popular.

I’ve used integers as primary keys, with UUIDs as alternate keys for external-to-the-data-store queries.

7 comments

"Bad for performance as primary keys" -> "Bad for performance as primary keys in MySQL". This isn't an issue in PostgreSQL and perhaps the lesson here is that as you scale, you need to understand more about the internals of the DB system you've chosen. This isn't limited to RDBMS as it's pretty easy to show trade-offs in choosing a NoSQL as well.
You are right, the internals are important. When I tested this, a long time ago [1] [2], I found that randomly distributed keys on PostgreSQL were indeed faster than e.g. on MySQL. Which surprised me! I still don't quite understand why. But, even with PostgreSQL, sequential (or nearly sequential) are much faster.

[1] https://markmail.org/thread/3jzqjy6cavxcrpbq [2] https://markmail.org/download.xqy?id=3jzqjy6cavxcrpbq&number...

> Which surprised me! I still don't quite understand why.

Because MySQL (specifically innodb) will cluster by the PK by default, while Posgres will not.

Meaning in MySQL, by default, the table will be stored in PK order, so when you insert an UUID at a random location any record which sorts after it has to be moved to make room, for nothing since UUID ordering is not really relevant (at least for the current standard UUIDs).

In pg, clustering is not a property (at least for now) it's an explicit point operation, which is not performed by default. So when you insert a row whether UUID or INTEGER-keyed, it's just tacked on wherever there's free space. The UUID still has to be inserted at the right place in the index' btree which can lead to a lot of cascading changes, and it's more expensive to compute, store, and load than a simple integer (especially 32b), but it doesn't induce an arbitrarily high amount of shuffling in the storage layer of the table itself.

While the problem on the article is less of an issue in Postgres (the indexing cache locality is still there), they are still slower than the serial ones.

I don't know if you save enough problems by using them as alternative keys in Postgres for it to be faster, my guess is that just using them as primary key would be faster than a serial primary key and an UUID alternative one. Still, UUIDs are much more useful as client-facing data than as a pure database value, so I would also do that (and standardize my PK format), probably paying a performance penalty.

"Slower" is a matter of context. Like with most things, there are trade offs to allow for more concurrency. UUIDs are intended to avoid serialization with ID generation, which makes sense in a federated system. Yes, they take up more space and aren't all neatly ordered, but the trade off works out well for the right kind of systems (FlakeIDs are a better trade off for more contexts). The problem this article is describing is you're getting the worst of both worlds, because you're still serializing in the MySQL database...
They are still generally found to be slower to use than sequential primary keys in postgresql.
There are remedies for that while still keeping the advantages of UUIDs. https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...
What if you convert from MySql to postgres? Will the uuids still benefit from postgres optimizations?
Rather: bad for performance (as primary keys) when read/write in sequential order.

Great for (primary keys or anything) in a distributed/sharded database (e.g. CockroachDB), when data access is mostly by keys.

this is what i do, any public facing id - url, or js/html that would show an id would be a uuid and then the backend queries are done with primary keys / integers.
We do the same.
I do something similar. It works well.
Done the same :-)
Yeah. The main problem is people using them as primary keys in naïve systems like relational databases. You can't just expect a relational database to magically become a distributed system just by using UUIDs. There is a bit more work to do than that.
People use UUIDs for more reasons than just making things distributed. You can generate them client side if that's advantageous, you prevent leaking information about how many records there are in the system and prevent guessing of other potential PKs and potential unauthorized access, and there's some optimization strategies that benefit from not relying on a serial PK.
If you're obfuscating, then you should be using a DHT to map surrogate keys to internal keys that are more convenient for use in a MySQL index... or just stop using MySQL.
Well they shouldn't. UUIDs are for distributed systems. Generating keys client side? That's a distributed system.
I've had occasional uses for it in systems that aren't distributed at all. It's a handy property of UUIDs, if UUIDs are useful for a particular use case, I'm going to use them, what they're 'supposed' to be for is irrelevant.
> You can't just expect a relational database to magically become a distributed system just by using UUIDs.

Nobody thinks this, do they?

You’d be surprised. Unpleasantly, unfortunately.

Not sure if the term gets used much now a days, but ‘cargo cult’ programming is definitely a thing still. Never stopped.

Somebody thought RabbitMQ did load balancing because it had three redundant nodes. Pretty much ruined the company. Definitely caused all of us to lose our equity.
I’d love to hear that story.
Is it too much to ask people to explain what's wrong with this well-written and on-topic comment instead of downvoting it?
Calling relational databases (HN’s preferred storage system) naive probably sounds like trolling to most people. There are also plenty of distributed relational databases. People downvote comments that sounds like trolling or flamebait.

I use UUIDs but I don’t know why they would magically make my Postgres a distributed system. I like them because the client can generate them offline.

What? Really? Naïve in this context means a general purpose solution that doesn't "know" about your use case. Have people never heard of a naïve algorithm or solution?

Distributed relational databases aren't naïve in this context. MySQL is.

> Have people never heard of a naïve algorithm or solution?

I have a fairly recent PhD in algorithms and I haven't heard naïve used this way. When I hear naïve, it usually just means "does the immediately obvious thing".

For what you're trying to say, the term I'm familiar with is "oblivious", e.g. "oblivious routing" or "oblivious local search", occasionally with modifiers such as "cache-oblivious".

I've never heard use of the word "oblivious", but my education is 15 years old at this point.
Naive is an emotionally loaded word outside of academic communities.
"naïve systems" doesn't make sense.

Nobody is assuming simply adding a UUID transforms a system into a distributed one.

The parent comment is about exposing UUIDs probably in order to not expose the sort order of the database.