| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by drenei 1665 days ago

Bad for performance as primary keys.

But, still provide strong value as a unique identifier which is what makes them popular.

I’ve used integers as primary keys, with UUIDs as alternate keys for external-to-the-data-store queries.

7 comments

smoyer 1665 days ago

"Bad for performance as primary keys" -> "Bad for performance as primary keys in MySQL". This isn't an issue in PostgreSQL and perhaps the lesson here is that as you scale, you need to understand more about the internals of the DB system you've chosen. This isn't limited to RDBMS as it's pretty easy to show trade-offs in choosing a NoSQL as well.

thomasmg 1665 days ago

You are right, the internals are important. When I tested this, a long time ago [1] [2], I found that randomly distributed keys on PostgreSQL were indeed faster than e.g. on MySQL. Which surprised me! I still don't quite understand why. But, even with PostgreSQL, sequential (or nearly sequential) are much faster.

[1] https://markmail.org/thread/3jzqjy6cavxcrpbq [2] https://markmail.org/download.xqy?id=3jzqjy6cavxcrpbq&number...

masklinn 1665 days ago

> Which surprised me! I still don't quite understand why.

Because MySQL (specifically innodb) will cluster by the PK by default, while Posgres will not.

Meaning in MySQL, by default, the table will be stored in PK order, so when you insert an UUID at a random location any record which sorts after it has to be moved to make room, for nothing since UUID ordering is not really relevant (at least for the current standard UUIDs).

In pg, clustering is not a property (at least for now) it's an explicit point operation, which is not performed by default. So when you insert a row whether UUID or INTEGER-keyed, it's just tacked on wherever there's free space. The UUID still has to be inserted at the right place in the index' btree which can lead to a lot of cascading changes, and it's more expensive to compute, store, and load than a simple integer (especially 32b), but it doesn't induce an arbitrarily high amount of shuffling in the storage layer of the table itself.

marcosdumay 1665 days ago

While the problem on the article is less of an issue in Postgres (the indexing cache locality is still there), they are still slower than the serial ones.

I don't know if you save enough problems by using them as alternative keys in Postgres for it to be faster, my guess is that just using them as primary key would be faster than a serial primary key and an UUID alternative one. Still, UUIDs are much more useful as client-facing data than as a pure database value, so I would also do that (and standardize my PK format), probably paying a performance penalty.

cbsmith 1665 days ago

"Slower" is a matter of context. Like with most things, there are trade offs to allow for more concurrency. UUIDs are intended to avoid serialization with ID generation, which makes sense in a federated system. Yes, they take up more space and aren't all neatly ordered, but the trade off works out well for the right kind of systems (FlakeIDs are a better trade off for more contexts). The problem this article is describing is you're getting the worst of both worlds, because you're still serializing in the MySQL database...

ris 1665 days ago

They are still generally found to be slower to use than sequential primary keys in postgresql.

ttfkam 1664 days ago

There are remedies for that while still keeping the advantages of UUIDs. https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

datavirtue 1665 days ago

What if you convert from MySql to postgres? Will the uuids still benefit from postgres optimizations?

isoos 1665 days ago

Rather: bad for performance (as primary keys) when read/write in sequential order.

Great for (primary keys or anything) in a distributed/sharded database (e.g. CockroachDB), when data access is mostly by keys.

ei8ths 1665 days ago

this is what i do, any public facing id - url, or js/html that would show an id would be a uuid and then the backend queries are done with primary keys / integers.

aaronharnly 1665 days ago

We do the same.

jasfi 1665 days ago

I do something similar. It works well.

KingOfCoders 1665 days ago

Done the same :-)

globular-toast 1665 days ago

Yeah. The main problem is people using them as primary keys in naïve systems like relational databases. You can't just expect a relational database to magically become a distributed system just by using UUIDs. There is a bit more work to do than that.

ryanbrunner 1665 days ago

People use UUIDs for more reasons than just making things distributed. You can generate them client side if that's advantageous, you prevent leaking information about how many records there are in the system and prevent guessing of other potential PKs and potential unauthorized access, and there's some optimization strategies that benefit from not relying on a serial PK.

cbsmith 1665 days ago

If you're obfuscating, then you should be using a DHT to map surrogate keys to internal keys that are more convenient for use in a MySQL index... or just stop using MySQL.

globular-toast 1665 days ago

Well they shouldn't. UUIDs are for distributed systems. Generating keys client side? That's a distributed system.

ryanbrunner 1665 days ago

I've had occasional uses for it in systems that aren't distributed at all. It's a handy property of UUIDs, if UUIDs are useful for a particular use case, I'm going to use them, what they're 'supposed' to be for is irrelevant.

JimDabell 1665 days ago

> You can't just expect a relational database to magically become a distributed system just by using UUIDs.

Nobody thinks this, do they?

lazide 1665 days ago

You’d be surprised. Unpleasantly, unfortunately.

Not sure if the term gets used much now a days, but ‘cargo cult’ programming is definitely a thing still. Never stopped.

datavirtue 1665 days ago

Somebody thought RabbitMQ did load balancing because it had three redundant nodes. Pretty much ruined the company. Definitely caused all of us to lose our equity.

withinboredom 1665 days ago

I’d love to hear that story.

globular-toast 1665 days ago

Is it too much to ask people to explain what's wrong with this well-written and on-topic comment instead of downvoting it?

jitl 1665 days ago

Calling relational databases (HN’s preferred storage system) naive probably sounds like trolling to most people. There are also plenty of distributed relational databases. People downvote comments that sounds like trolling or flamebait.

I use UUIDs but I don’t know why they would magically make my Postgres a distributed system. I like them because the client can generate them offline.

globular-toast 1665 days ago

What? Really? Naïve in this context means a general purpose solution that doesn't "know" about your use case. Have people never heard of a naïve algorithm or solution?

Distributed relational databases aren't naïve in this context. MySQL is.

CaptainNegative 1665 days ago

> Have people never heard of a naïve algorithm or solution?

I have a fairly recent PhD in algorithms and I haven't heard naïve used this way. When I hear naïve, it usually just means "does the immediately obvious thing".

For what you're trying to say, the term I'm familiar with is "oblivious", e.g. "oblivious routing" or "oblivious local search", occasionally with modifiers such as "cache-oblivious".

globular-toast 1665 days ago

I've never heard use of the word "oblivious", but my education is 15 years old at this point.

9wzYQbTYsAIc 1665 days ago

Naive is an emotionally loaded word outside of academic communities.

heurisko 1665 days ago

"naïve systems" doesn't make sense.

Nobody is assuming simply adding a UUID transforms a system into a distributed one.

The parent comment is about exposing UUIDs probably in order to not expose the sort order of the database.