Hacker News new | ask | show | jobs
by klodolph 3725 days ago
I just clicked on the second "featured" pattern and found this hot mess:

http://dbpatterns.com/documents/5091f74289cbad03bc958bc0/

It has the "let's put a UUID on every row" disease common to designers who have never really learned anything other than object oriented design. Price is a string (I guess so you can put "market price" on the fish?), and there's a currency symbol on every "delivery". The whole thing just makes no sense. 18 stars. Unbelievable.

6 comments

I'm daft. What's wrong with using uuids as primary keys everywhere? I'd appreciate it if you could elaborate :)
The problem is that most uuids are generated in a way that, when sorted alphabetically as strings, would have a random order.

Example: t0: 7458e3a9-716b-4352-b2e4-b5b67d0c089b t1: 4d8d753c-1777-439d-8725-b093b1bd8430

Using this as a PK in any relational database will mean the rows are stored in the clustered index order, which causes extreme fragmentation because the db engine constantly has to find "holes" in the data pages to insert more recent records instead of adding those at the bottom of the table as would happen with any continuously increasing key.

This is a non-issue. Most db engines just append new data rows at the end of the data table and assign an internal row ID for it. The keys (UUID in this case) are stored in the separate index pages using B+ tree, which searches random key (UUID) or sequentially incremented key equally well.

UUID key is a problem only if your main query is a range query on the PK of a clustered indexed table. If your main query is a range query, you should pick something that can be sorted in the range anyway.

A lot of times when a UUID is used, it's not generated in the db, such as when you need to generate guaranteed unique IDs client side. Also 4 bytes vs 16 is not "equally well" in terms of performance. Note that the the diagram linked at the root of this thread is encoding uuids as strings which means 32-36 bytes instead of 16.
Actually only postgresql does it this way (by storing data in the heap and not in the primary index). Mysql(innodb),mssql,oracle uses the store the row in the primary-key.

Edit: I'm ~wrong, see below.

Only Innodb does it by default. The others use heap table by default.

MSSQL allows clustered indexed table as an option to order the physical storage of rows. Oracle has index-organized table as an option.

Edit: They don't use clustered indexed table by default because record insertion is very expensive since clustered index forces the table to store the records contiguously in the index's order. Also Innodb is not truely clustered indexed. It only stores records contiguously for one page at the B+ Tree leaf level. Records in different pages are scattered all over even if the index values are sequential.

It's pretty variable, but Postgres definitely is not the only one that does it that way (or even close).
Not the case with MS SQL Server for example (and there are probably several others). Physical row ordering (at least up to 2008 edition as far as I can recall) coincides with clustered index.
I don't believe that the Primary Key affects the physical ordering of records in PostgreSQL.

Primary Key is implemented as a unique index coupled with a not null constraint. It's essentially syntactic sugar.

To elaborate, even if you do call CLUSTER[1] to enforce a physical ordering, the primary key is not the default (there is no default).

1. http://www.postgresql.org/docs/9.5/static/sql-cluster.html

That would be true except for that fact that database designers know this, and offer a way to generate sequential GUID's; so it's actually not a problem for that reason. The only real downside is performance, an int key performs better.
A number of scenarios that require UUIDs as keys, involve generation client side where such guarantees may not be available. In my view, UUIDs should not be used unless there is a strong reason for it (such as making keys not guessable). Most people will implement them incorrectly/inefficiently.
> an int key performs better.

A UUID is a 128-bit integer. That people do not store, generate, or interact with them that way is the bug.

I'd kill for a 128-bit architecture so a UUID compare would be a single instruction, and it's too bad consensus is that we don't really need it.

> I'd kill for a 128-bit architecture so a UUID compare would be a single instruction

Well, there's PCMPESTRI on Intel.

http://www.felixcloutier.com/x86/PCMPESTRI.html

Yeah, I just said this in another comment, but SSE is my weak point; I stopped writing assembly and dealing with this stuff before SSE was even introduced.

That looks basically like what you need, though, yeah. Cool.

An int is either 4 or 8 bytes, it's still going to perform better no matter how you treat the GUID.

edit: bytes

> An int is either 4 or 8 bits

This perfectly summarizes the bitwise-innumeracy of the argument against UUIDs. Even without 128 bit word sizes, most UUIDs are going to be non-matches in their lower bits, assuming a random distribution. There's no computational efficiency gained in the vastly wide critical path here.

I hope you mean bytes. On a 128bit architecture that would be a single instruction compare, just like a 32 or 64 bit compare. Performance would be roughly the same (minus maybe a cache miss because you blew it out with your fat ints).
Every sequential GUID generation feature provides guarantees only per session. In other words, when you bring your database down to upgrade or patch it and then bring it back up, your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.
> your GUID sequence starts over again at some arbitrary value which is likely to overlap with your existing GUIDs.

I don't believe that's true. You're right that they start the session over, but it'll still be a fresh sequence that doesn't overlap anything generated in the past.

So you're saying it's specifically less efficient for a write-heavy tables, correct?

I generally prefer auto-incremented integers, but UUIDs are very useful for client-generated records (like for an app/website that's built for offline usage).

It's also bigger than necessary when a sequence will do. Large PKs mean larger indexes. For very large databases or very high volume every little inefficiency adds up.

As I always say, if we don't split hairs now, we'll be splitting heads later.

I don't understand this... How does the sort order in the index have anything to do with the arrangement of the table? If you build an index over the UUIDs, it's just going to refer to rows in the table by their internal offsets.
>How does the sort order in the index have anything to do with the arrangement of the table?

If you have a clustered index based upon that column (and in systems that support this it is common for the primary key to be the clustered index for the table) then the physical layout of the data will follow this column.

See http://use-the-index-luke.com/blog/2014-01/unreasonable-defa... as a good run down of the whys and wherefores of clustered indexes.

Having a sort order is immensely preferable when you're joining multiple sequential rows at once. The optimiser can create a pseudo-partition for your join, and constrain its reads to that. Especially on spinning rust (or I/O constrained systems), you're also reading multiple sequential pages in a stream off disk, which gives a big speed boost compared to random reads.

Without that ordering, the index is 'dumb', and has to find all your rows one by one in its structure. Admittedly it's not as much of an issue for SSDs, but it does still impact performance.

First there is the question of whether one should have surrogate keys at all, some oppose surrogate keys because: They make tables more difficult to reason about. They encourage structures that involve large amounts of joins.

Others embrace surrogate keys because: They make joins between tables easier (usually one field rather than a compound key where one might accidentally not include all the join predicates). They save space since an integer is almost always smaller than a compound key.

So if you are in the camp that favors surrogate keys, then consider these points:

Some quick downsides for UUIDs: If your table has a clustered index, UUIDs are generally not created in a regular order, and so you'll be constantly inserting into a part of your index rather than the end. Some databases allow sequential UUID generation to help mitigate this. UUIDs are usually 16 bytes, but an int is usually 1-8 bytes. Also keep in mind you pay this price multiple times since the primary key (generally) becomes the foreign key in other tables.

Some upsides for UUIDs: If you have a process that has no master, being able to tie disparate things together with a shared UUID is very useful sometimes.

Often developers use UUIDs/GUIDs because: Their app wants to perform a 'Create' operation and they want to generate the key in advance.

This can usually be mitigated by creating a routine that generates a range of reserved surrogate integer keys, and gives a starting value to the application layer. Then the application layer can use those values with knowledge they won't already exist. For instance, the app layer say, "I'm going to make 100 customers", and the response comes back "10232". Meaning that 10132-10232 is reserved for that process to use for customer primary keys. This allows larger (chunkier) requests which can be interrelated without constantly making requests to the Database layer. One could extent this to be a true cache for entity reserved numbers that gets dolled out as needed by the app layer.

UUID as PK is perfectly fine; however, as with any design decision it really depends on your needs and weighing alternatives.

Some popular ways to do PK: natural key, sequentially generated key, and UUID. Personally I would prefer natural key if I can find an immutable natural key for the table. However, natural keys are hard to find. The food example in this case doesn't have natural key. Also if the natural key requires a compound key, it's just not worth the pain.

Sequential auto generated key is good when you need to hand it off to users, like order id or ticket number. It's short and simple and it's auto generated. The downsize is when you migrate databases, you need to seed the new database carefully or it would start from beginning again. Also in record creation, you need to do extra read to get back the newly generated key.

UUID is mostly worry free. It can be generated anywhere and doesn't need to be in the database. For a setup with distributed databases, I would use UUID just to have global uniqueness. For an offline app, I would use UUID to create the data records locally and later sync them with the main database. UUID is good when it's used internally and not exposed to the users.

Sequential IDs can also leak business information when exposed to customers such as how many orders your taking - an interested party can place an order at 10am and another at 11am. Once you've done that you can compare the IDs and know how many orders were taken that hour.
Why should UUIDs not be exposed to users? Because of the their unwieldy look, or because of security concerns?

I ask because I have an app where previously I was accessing a particular data point via pk, and the user saw the pk in the url bar. But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk. I switched to uuid and now it's a little ugly in the url bar, but no data is exposed.

I meant the key handed off to the user for human interaction. As a user I would prefer order number 7356 than 7458e3a9-716b-4352-b2e4-b5b67d0c089b.

If there's no human interaction with the key, UUID is perfectly fine.

I've seen scenarios where UUIDs were used intentionally for this, to prevent users who get a full export of table data to reference the UUID. If it's just a sequence of small integers they might them to more permanent and intrinsically tied to the data, whereas UUIDs seem more like computing artifacts.

Yeah, I know, this could easily fixed by not friggin' showing surrogate keys to those users in the first place, but, well, data integration is ugly business.

Gotcha, just wanted to make sure I wasn't missing something. Makes total sense, thanks!
> But it could expose user/site data as any user could access that pk, or just guess at the next sequentially generated pk.

Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity. The real solution is implementing real authentication & authorization - not making the key harder to guess.

> Using a UUID instead of a sequentially rolling integer ID isn't solving your problem, you're just doing security through obscurity.

A common sentiment, but not true if your id is cryptographically random. It amounts to capability security which is the right approach to security if used comprehensively.

Came across an interesting case against using UUIDs the other day, in the UK it's a legal requirement that VAT invoices are sequential: http://www.hmrc.gov.uk/manuals/vatrecmanual/vatrec5020.htm

Using random UUIDs as an invoice number wouldn't cut it, you'd have to have a sequential ID in some form. You could of course have both but that's adding unnecessary complexity.

Another example, if you are designing a website and lets say give each blog post a UUID, a URL of:

website.com/blog/my-post-34

Looks better than

website.com/blog/my-post-d173affb-6bb8-4435-bef1-4e29409dff4c

Unique sequential Id's also give some clues about the relative age of rows (eg ID 100 is older than ID 50). This doesn't come up often but can sometimes be useful.

Regardless, I always find using a unique int ID for pretty much every record makes your job 10x easier and I fail to see why you would ever want to do anything different. If you want to 'hide' the ID in special cases (eg order numbers) better to generate a unique random int ID.

Sequential order IDs are also mandatory in Germany. At least I remember that we had to make sure in one project that the order IDs had no gaps and were sequential.
There is no way to enforce data integrity between related tables by virtue of the data model for one thing.
I don't get this argument. Foreign key constraints are part of the schema.

What should the primary key be if not a UUID?

A natural key. It is not always possible, but when it is it should be used.
an integer id?
isn't that what the fk are for? food_uuid for example. I concede I wouldn't call it uuid but I don't see a problem keeping the data straight.
?? foreign keys?
But where do you even put the price on a food item to start with?

The only price is on the 'delivery' table, which seems to imply that the price is only set when someone orders something. I guess this is a delivery-only place? But because the 'delivery' table can only have one food_uuid, you'd have to put in for a bunch of deliveries to order more than one thing.

Maybe the 'price' on the menu is in that 'Additional Info' table?

    SELECT A.NAME, A.DESCRIPTION, B.VALUE FROM FOOD A LEFT OUTER JOIN "Additional Info" B ON (A.UUID = B.UUID AND B.TYPE = 'Price');
There's so much wrong here...
I don't disagree, but there is a case to be made for using UUID over sequential integer ids, and that is for horizontal scaling. When inserting records with UUID primary keys, you don't have to have your db cluster synchronized around which sequential ID should be next. There are other ways around this problem, of course, but using UUIDs sidesteps the issue entirely.
Because a restaurant menu needs horizontal scaling...
Any data might need horizontal scaling regardless of the type of content. It all comes down to how much data and how often it is accessed. So, sure, if you're building the website for a single restaurant, don't use UUIDs, because no matter how popular, it will never get that much traffic.
Yeah, i found that that too, spent 10 mins poking at it:

http://dbpatterns.com/documents/5707794a1514b4252236dcbe/

better, but not finished...

And only one comment that says "thhx" haha

Like any open advice community like this, its important there are checks in place to not perpetuate bad ideas...

Did it cross your mind that there is a whole host of reasons why it may have 18 stars that have nothing to do with the quality of the pattern?