| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by skrebbel 3725 days ago
	I'm daft. What's wrong with using uuids as primary keys everywhere? I'd appreciate it if you could elaborate :)

6 comments

mvitorino 3725 days ago

The problem is that most uuids are generated in a way that, when sorted alphabetically as strings, would have a random order.

Example: t0: 7458e3a9-716b-4352-b2e4-b5b67d0c089b t1: 4d8d753c-1777-439d-8725-b093b1bd8430

Using this as a PK in any relational database will mean the rows are stored in the clustered index order, which causes extreme fragmentation because the db engine constantly has to find "holes" in the data pages to insert more recent records instead of adding those at the bottom of the table as would happen with any continuously increasing key.

ww520 3725 days ago

This is a non-issue. Most db engines just append new data rows at the end of the data table and assign an internal row ID for it. The keys (UUID in this case) are stored in the separate index pages using B+ tree, which searches random key (UUID) or sequentially incremented key equally well.

UUID key is a problem only if your main query is a range query on the PK of a clustered indexed table. If your main query is a range query, you should pick something that can be sorted in the range anyway.

mvitorino 3725 days ago

A lot of times when a UUID is used, it's not generated in the db, such as when you need to generate guaranteed unique IDs client side. Also 4 bytes vs 16 is not "equally well" in terms of performance. Note that the the diagram linked at the root of this thread is encoding uuids as strings which means 32-36 bytes instead of 16.

ddorian43 3725 days ago

Actually only postgresql does it this way (by storing data in the heap and not in the primary index). Mysql(innodb),mssql,oracle uses the store the row in the primary-key.

Edit: I'm ~wrong, see below.

ww520 3725 days ago

Only Innodb does it by default. The others use heap table by default.

MSSQL allows clustered indexed table as an option to order the physical storage of rows. Oracle has index-organized table as an option.

Edit: They don't use clustered indexed table by default because record insertion is very expensive since clustered index forces the table to store the records contiguously in the index's order. Also Innodb is not truely clustered indexed. It only stores records contiguously for one page at the B+ Tree leaf level. Records in different pages are scattered all over even if the index values are sequential.

Jweb_Guru 3725 days ago

It's pretty variable, but Postgres definitely is not the only one that does it that way (or even close).

mvitorino 3725 days ago

Not the case with MS SQL Server for example (and there are probably several others). Physical row ordering (at least up to 2008 edition as far as I can recall) coincides with clustered index.

wiredfool 3725 days ago

I don't believe that the Primary Key affects the physical ordering of records in PostgreSQL.

Primary Key is implemented as a unique index coupled with a not null constraint. It's essentially syntactic sugar.

mortehu 3725 days ago

To elaborate, even if you do call CLUSTER[1] to enforce a physical ordering, the primary key is not the default (there is no default).

1. http://www.postgresql.org/docs/9.5/static/sql-cluster.html

gnaritas 3725 days ago

That would be true except for that fact that database designers know this, and offer a way to generate sequential GUID's; so it's actually not a problem for that reason. The only real downside is performance, an int key performs better.

mvitorino 3725 days ago

A number of scenarios that require UUIDs as keys, involve generation client side where such guarantees may not be available. In my view, UUIDs should not be used unless there is a strong reason for it (such as making keys not guessable). Most people will implement them incorrectly/inefficiently.

jsmthrowaway 3725 days ago

> an int key performs better.

A UUID is a 128-bit integer. That people do not store, generate, or interact with them that way is the bug.

I'd kill for a 128-bit architecture so a UUID compare would be a single instruction, and it's too bad consensus is that we don't really need it.

mortehu 3725 days ago

> I'd kill for a 128-bit architecture so a UUID compare would be a single instruction

Well, there's PCMPESTRI on Intel.

http://www.felixcloutier.com/x86/PCMPESTRI.html

jsmthrowaway 3725 days ago

Yeah, I just said this in another comment, but SSE is my weak point; I stopped writing assembly and dealing with this stuff before SSE was even introduced.

That looks basically like what you need, though, yeah. Cool.

gnaritas 3725 days ago

An int is either 4 or 8 bytes, it's still going to perform better no matter how you treat the GUID.

edit: bytes

jessedhillon 3725 days ago

> An int is either 4 or 8 bits

This perfectly summarizes the bitwise-innumeracy of the argument against UUIDs. Even without 128 bit word sizes, most UUIDs are going to be non-matches in their lower bits, assuming a random distribution. There's no computational efficiency gained in the vastly wide critical path here.

gnaritas 3725 days ago

> most UUIDs are going to be non-matches in their lower bits, assuming a random distribution

But you can't assume a random distribution because anyone using UUID's for database keys isn't using random UUID's, they're using sequential ones.

jandrese 3725 days ago

I hope you mean bytes. On a 128bit architecture that would be a single instruction compare, just like a 32 or 64 bit compare. Performance would be roughly the same (minus maybe a cache miss because you blew it out with your fat ints).

dhd415 3725 days ago

Every sequential GUID generation feature provides guarantees only per session. In other words, when you bring your database down to upgrade or patch it and then bring it back up, your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.

gnaritas 3725 days ago

> your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.

I don't believe that's true. You're right that they start the session over, but it'll still be a fresh sequence that doesn't overlap anything generated in the past.

enobrev 3725 days ago

So you're saying it's specifically less efficient for a write-heavy tables, correct?

I generally prefer auto-incremented integers, but UUIDs are very useful for client-generated records (like for an app/website that's built for offline usage).

acscott 3725 days ago

It's also bigger than necessary when a sequence will do. Large PKs mean larger indexes. For very large databases or very high volume every little inefficiency adds up.

As I always say, if we don't split hairs now, we'll be splitting heads later.

Retra 3725 days ago

I don't understand this... How does the sort order in the index have anything to do with the arrangement of the table? If you build an index over the UUIDs, it's just going to refer to rows in the table by their internal offsets.

dspillett 3725 days ago

>How does the sort order in the index have anything to do with the arrangement of the table?

If you have a clustered index based upon that column (and in systems that support this it is common for the primary key to be the clustered index for the table) then the physical layout of the data will follow this column.

See http://use-the-index-luke.com/blog/2014-01/unreasonable-defa... as a good run down of the whys and wherefores of clustered indexes.

slagfart 3725 days ago

Having a sort order is immensely preferable when you're joining multiple sequential rows at once. The optimiser can create a pseudo-partition for your join, and constrain its reads to that. Especially on spinning rust (or I/O constrained systems), you're also reading multiple sequential pages in a stream off disk, which gives a big speed boost compared to random reads.

Without that ordering, the index is 'dumb', and has to find all your rows one by one in its structure. Admittedly it's not as much of an issue for SSDs, but it does still impact performance.

nickpeterson 3725 days ago

First there is the question of whether one should have surrogate keys at all, some oppose surrogate keys because: They make tables more difficult to reason about. They encourage structures that involve large amounts of joins.

Others embrace surrogate keys because: They make joins between tables easier (usually one field rather than a compound key where one might accidentally not include all the join predicates). They save space since an integer is almost always smaller than a compound key.

So if you are in the camp that favors surrogate keys, then consider these points:

Some quick downsides for UUIDs: If your table has a clustered index, UUIDs are generally not created in a regular order, and so you'll be constantly inserting into a part of your index rather than the end. Some databases allow sequential UUID generation to help mitigate this. UUIDs are usually 16 bytes, but an int is usually 1-8 bytes. Also keep in mind you pay this price multiple times since the primary key (generally) becomes the foreign key in other tables.

Some upsides for UUIDs: If you have a process that has no master, being able to tie disparate things together with a shared UUID is very useful sometimes.

Often developers use UUIDs/GUIDs because: Their app wants to perform a 'Create' operation and they want to generate the key in advance.

This can usually be mitigated by creating a routine that generates a range of reserved surrogate integer keys, and gives a starting value to the application layer. Then the application layer can use those values with knowledge they won't already exist. For instance, the app layer say, "I'm going to make 100 customers", and the response comes back "10232". Meaning that 10132-10232 is reserved for that process to use for customer primary keys. This allows larger (chunkier) requests which can be interrelated without constantly making requests to the Database layer. One could extent this to be a true cache for entity reserved numbers that gets dolled out as needed by the app layer.

ww520 3725 days ago

UUID as PK is perfectly fine; however, as with any design decision it really depends on your needs and weighing alternatives.

Some popular ways to do PK: natural key, sequentially generated key, and UUID. Personally I would prefer natural key if I can find an immutable natural key for the table. However, natural keys are hard to find. The food example in this case doesn't have natural key. Also if the natural key requires a compound key, it's just not worth the pain.

Sequential auto generated key is good when you need to hand it off to users, like order id or ticket number. It's short and simple and it's auto generated. The downsize is when you migrate databases, you need to seed the new database carefully or it would start from beginning again. Also in record creation, you need to do extra read to get back the newly generated key.

UUID is mostly worry free. It can be generated anywhere and doesn't need to be in the database. For a setup with distributed databases, I would use UUID just to have global uniqueness. For an offline app, I would use UUID to create the data records locally and later sync them with the main database. UUID is good when it's used internally and not exposed to the users.

jon-wood 3725 days ago

Sequential IDs can also leak business information when exposed to customers such as how many orders your taking - an interested party can place an order at 10am and another at 11am. Once you've done that you can compare the IDs and know how many orders were taken that hour.

irl_zebra 3725 days ago

Why should UUIDs not be exposed to users? Because of the their unwieldy look, or because of security concerns?

I ask because I have an app where previously I was accessing a particular data point via pk, and the user saw the pk in the url bar. But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk. I switched to uuid and now it's a little ugly in the url bar, but no data is exposed.

ww520 3725 days ago

I meant the key handed off to the user for human interaction. As a user I would prefer order number 7356 than 7458e3a9-716b-4352-b2e4-b5b67d0c089b.

If there's no human interaction with the key, UUID is perfectly fine.

mhd 3725 days ago

I've seen scenarios where UUIDs were used intentionally for this, to prevent users who get a full export of table data to reference the UUID. If it's just a sequence of small integers they might them to more permanent and intrinsically tied to the data, whereas UUIDs seem more like computing artifacts.

Yeah, I know, this could easily fixed by not friggin' showing surrogate keys to those users in the first place, but, well, data integration is ugly business.

irl_zebra 3725 days ago

Gotcha, just wanted to make sure I wasn't missing something. Makes total sense, thanks!

radagast 3725 days ago

> But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk.

Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity. The real solution is implementing real authentication & authorization - not making the key harder to guess.

naasking 3725 days ago

> Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity.

A common sentiment, but not true if your id is cryptographically random. It amounts to capability security which is the right approach to security if used comprehensively.

Scirra_Tom 3725 days ago

Came across an interesting case against using UUIDs the other day, in the UK it's a legal requirement that VAT invoices are sequential: http://www.hmrc.gov.uk/manuals/vatrecmanual/vatrec5020.htm

Using random UUIDs as an invoice number wouldn't cut it, you'd have to have a sequential ID in some form. You could of course have both but that's adding unnecessary complexity.

Another example, if you are designing a website and lets say give each blog post a UUID, a URL of:

website.com/blog/my-post-34

Looks better than

website.com/blog/my-post-d173affb-6bb8-4435-bef1-4e29409dff4c

Unique sequential Id's also give some clues about the relative age of rows (eg ID 100 is older than ID 50). This doesn't come up often but can sometimes be useful.

Regardless, I always find using a unique int ID for pretty much every record makes your job 10x easier and I fail to see why you would ever want to do anything different. If you want to 'hide' the ID in special cases (eg order numbers) better to generate a unique random int ID.

xrstf 3725 days ago

Sequential order IDs are also mandatory in Germany. At least I remember that we had to make sure in one project that the order IDs had no gaps and were sequential.

clavalle 3725 days ago

There is no way to enforce data integrity between related tables by virtue of the data model for one thing.

gcr 3725 days ago

I don't get this argument. Foreign key constraints are part of the schema.

What should the primary key be if not a UUID?

clavalle 3725 days ago

A natural key. It is not always possible, but when it is it should be used.

cortesoft 3725 days ago

an integer id?

billybugmaster 3725 days ago

isn't that what the fk are for? food_uuid for example. I concede I wouldn't call it uuid but I don't see a problem keeping the data straight.

jtmarmon 3725 days ago

?? foreign keys?

kawera 3725 days ago

Some good answers here: http://stackoverflow.com/questions/45399/advantages-and-disa...