Using this as a PK in any relational database will mean the rows are stored in the clustered index order, which causes extreme fragmentation because the db engine constantly has to find "holes" in the data pages to insert more recent records instead of adding those at the bottom of the table as would happen with any continuously increasing key.
This is a non-issue. Most db engines just append new data rows at the end of the data table and assign an internal row ID for it. The keys (UUID in this case) are stored in the separate index pages using B+ tree, which searches random key (UUID) or sequentially incremented key equally well.
UUID key is a problem only if your main query is a range query on the PK of a clustered indexed table. If your main query is a range query, you should pick something that can be sorted in the range anyway.
A lot of times when a UUID is used, it's not generated in the db, such as when you need to generate guaranteed unique IDs client side. Also 4 bytes vs 16 is not "equally well" in terms of performance. Note that the the diagram linked at the root of this thread is encoding uuids as strings which means 32-36 bytes instead of 16.
Actually only postgresql does it this way (by storing data in the heap and not in the primary index). Mysql(innodb),mssql,oracle uses the store the row in the primary-key.
Only Innodb does it by default. The others use heap table by default.
MSSQL allows clustered indexed table as an option to order the physical storage of rows. Oracle has index-organized table as an option.
Edit: They don't use clustered indexed table by default because record insertion is very expensive since clustered index forces the table to store the records contiguously in the index's order. Also Innodb is not truely clustered indexed. It only stores records contiguously for one page at the B+ Tree leaf level. Records in different pages are scattered all over even if the index values are sequential.
Not the case with MS SQL Server for example (and there are probably several others). Physical row ordering (at least up to 2008 edition as far as I can recall) coincides with clustered index.
That would be true except for that fact that database designers know this, and offer a way to generate sequential GUID's; so it's actually not a problem for that reason. The only real downside is performance, an int key performs better.
A number of scenarios that require UUIDs as keys, involve generation client side where such guarantees may not be available. In my view, UUIDs should not be used unless there is a strong reason for it (such as making keys not guessable). Most people will implement them incorrectly/inefficiently.
Yeah, I just said this in another comment, but SSE is my weak point; I stopped writing assembly and dealing with this stuff before SSE was even introduced.
That looks basically like what you need, though, yeah. Cool.
This perfectly summarizes the bitwise-innumeracy of the argument against UUIDs. Even without 128 bit word sizes, most UUIDs are going to be non-matches in their lower bits, assuming a random distribution. There's no computational efficiency gained in the vastly wide critical path here.
I hope you mean bytes. On a 128bit architecture that would be a single instruction compare, just like a 32 or 64 bit compare. Performance would be roughly the same (minus maybe a cache miss because you blew it out with your fat ints).
Every sequential GUID generation feature provides guarantees only per session. In other words, when you bring your database down to upgrade or patch it and then bring it back up, your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.
> your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.
I don't believe that's true. You're right that they start the session over, but it'll still be a fresh sequence that doesn't overlap anything generated in the past.
So you're saying it's specifically less efficient for a write-heavy tables, correct?
I generally prefer auto-incremented integers, but UUIDs are very useful for client-generated records (like for an app/website that's built for offline usage).
It's also bigger than necessary when a sequence will do. Large PKs mean larger indexes. For very large databases or very high volume every little inefficiency adds up.
As I always say, if we don't split hairs now, we'll be splitting heads later.
I don't understand this... How does the sort order in the index have anything to do with the arrangement of the table? If you build an index over the UUIDs, it's just going to refer to rows in the table by their internal offsets.
>How does the sort order in the index have anything to do with the arrangement of the table?
If you have a clustered index based upon that column (and in systems that support this it is common for the primary key to be the clustered index for the table) then the physical layout of the data will follow this column.
Having a sort order is immensely preferable when you're joining multiple sequential rows at once. The optimiser can create a pseudo-partition for your join, and constrain its reads to that. Especially on spinning rust (or I/O constrained systems), you're also reading multiple sequential pages in a stream off disk, which gives a big speed boost compared to random reads.
Without that ordering, the index is 'dumb', and has to find all your rows one by one in its structure. Admittedly it's not as much of an issue for SSDs, but it does still impact performance.
First there is the question of whether one should have surrogate keys at all, some oppose surrogate keys because:
They make tables more difficult to reason about.
They encourage structures that involve large amounts of joins.
Others embrace surrogate keys because:
They make joins between tables easier (usually one field rather than a compound key where one might accidentally not include all the join predicates).
They save space since an integer is almost always smaller than a compound key.
So if you are in the camp that favors surrogate keys, then consider these points:
Some quick downsides for UUIDs:
If your table has a clustered index, UUIDs are generally not created in a regular order, and so you'll be constantly inserting into a part of your index rather than the end. Some databases allow sequential UUID generation to help mitigate this.
UUIDs are usually 16 bytes, but an int is usually 1-8 bytes. Also keep in mind you pay this price multiple times since the primary key (generally) becomes the foreign key in other tables.
Some upsides for UUIDs:
If you have a process that has no master, being able to tie disparate things together with a shared UUID is very useful sometimes.
Often developers use UUIDs/GUIDs because:
Their app wants to perform a 'Create' operation and they want to generate the key in advance.
This can usually be mitigated by creating a routine that generates a range of reserved surrogate integer keys, and gives a starting value to the application layer. Then the application layer can use those values with knowledge they won't already exist. For instance, the app layer say, "I'm going to make 100 customers", and the response comes back "10232". Meaning that 10132-10232 is reserved for that process to use for customer primary keys. This allows larger (chunkier) requests which can be interrelated without constantly making requests to the Database layer. One could extent this to be a true cache for entity reserved numbers that gets dolled out as needed by the app layer.
UUID as PK is perfectly fine; however, as with any design decision it really depends on your needs and weighing alternatives.
Some popular ways to do PK: natural key, sequentially generated key, and UUID. Personally I would prefer natural key if I can find an immutable natural key for the table. However, natural keys are hard to find. The food example in this case doesn't have natural key. Also if the natural key requires a compound key, it's just not worth the pain.
Sequential auto generated key is good when you need to hand it off to users, like order id or ticket number. It's short and simple and it's auto generated. The downsize is when you migrate databases, you need to seed the new database carefully or it would start from beginning again. Also in record creation, you need to do extra read to get back the newly generated key.
UUID is mostly worry free. It can be generated anywhere and doesn't need to be in the database. For a setup with distributed databases, I would use UUID just to have global uniqueness. For an offline app, I would use UUID to create the data records locally and later sync them with the main database. UUID is good when it's used internally and not exposed to the users.
Sequential IDs can also leak business information when exposed to customers such as how many orders your taking - an interested party can place an order at 10am and another at 11am. Once you've done that you can compare the IDs and know how many orders were taken that hour.
Why should UUIDs not be exposed to users? Because of the their unwieldy look, or because of security concerns?
I ask because I have an app where previously I was accessing a particular data point via pk, and the user saw the pk in the url bar. But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk. I switched to uuid and now it's a little ugly in the url bar, but no data is exposed.
I've seen scenarios where UUIDs were used intentionally for this, to prevent users who get a full export of table data to reference the UUID. If it's just a sequence of small integers they might them to more permanent and intrinsically tied to the data, whereas UUIDs seem more like computing artifacts.
Yeah, I know, this could easily fixed by not friggin' showing surrogate keys to those users in the first place, but, well, data integration is ugly business.
> But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk.
Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity. The real solution is implementing real authentication & authorization - not making the key harder to guess.
> Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity.
A common sentiment, but not true if your id is cryptographically random. It amounts to capability security which is the right approach to security if used comprehensively.
Using random UUIDs as an invoice number wouldn't cut it, you'd have to have a sequential ID in some form. You could of course have both but that's adding unnecessary complexity.
Another example, if you are designing a website and lets say give each blog post a UUID, a URL of:
Unique sequential Id's also give some clues about the relative age of rows (eg ID 100 is older than ID 50). This doesn't come up often but can sometimes be useful.
Regardless, I always find using a unique int ID for pretty much every record makes your job 10x easier and I fail to see why you would ever want to do anything different. If you want to 'hide' the ID in special cases (eg order numbers) better to generate a unique random int ID.
Sequential order IDs are also mandatory in Germany. At least I remember that we had to make sure in one project that the order IDs had no gaps and were sequential.
Example: t0: 7458e3a9-716b-4352-b2e4-b5b67d0c089b t1: 4d8d753c-1777-439d-8725-b093b1bd8430
Using this as a PK in any relational database will mean the rows are stored in the clustered index order, which causes extreme fragmentation because the db engine constantly has to find "holes" in the data pages to insert more recent records instead of adding those at the bottom of the table as would happen with any continuously increasing key.