Hacker News new | ask | show | jobs
by dnewcome 5277 days ago
I've got to push back on the anti-guid stance for surrogate keys. They may be big but it is nice to have opaque IDs so that no one makes flawed assumptions about ordering or relative insertion time/trying to guess valid IDs, etc. Opaque is good...
3 comments

The size of guids isn't the only challenge. (The size is a little trickier than it looks, too - remember that the guid will be included in every index so the database can find its way back to the clustered index.)

The other problem with guids is that if you cluster on a guid, then whenever you do inserts, you're inserting the data into random spots in the table. You end up fragmenting the bejeezus out of the table, doing page splits like crazy. If you defrag the table/indexes, you'll be right back where you started within a few days of doing inserts.

That penalty isn't immediately obvious in small environments, but by the time you're big enough to have performance problems and you call in a consultant, it's going to be an ugly discussion with management. "Hey, wish I could help you, but..." Don't get me wrong, that's not always the biggest bottleneck, but it makes for some awkward discussions.

GUIDs are HUGE and have a significant impact on performance. Go read Kimberly Tripp's blog post referenced in the article. She does the math for you.

In almost all the cases that I see GUIDs, they were totally unnecessary for the design. Even the developer who designed them could not give a reason why they needed to be GUIDs. A row unique across the entire universe? Really?

I'm not saying there are no cases...just that in most cases they negatively impact performance with little business or logic gain for that price. All design decisions come down to cost, benefit and risk.

How do modern distributed OLTP systems deal with generating unique sequence numbers? Back in the day, this was a big problem, and I always thought that GUIDs would someday be a solution (though at the time, any string was too big/slow to be a primary key). Having one key-issuing server was a SPF; sharding the key ranges by server made it difficult to add new servers.
You can have the main key issuing server issue blocks to distributed key issuing servers. The keys are not necessarily increasing with respect to row creation time nor are they gapless but it's good enough for many purposes. It's also still a SPoF but not a single point of contention bad enough to affect performance significantly.
Interesting.. how do you deal with ID collisions after rollover? That was our big problem, even with 8-byte IDs - the ID allocator had to know the Global Truth. Dealing with collisions as "exception, try the next one" was too expensive, though with faster hardware that may be moot.
Yes, rollover is an interesting problem. With the app and scale that we're operating at that will not happen within the next few thousand centuries though. I'm actually describing how I might go about doing it, and what I /think/ Oracle RAC does.

I'm not actually certain as to the implementation.

This was my sentiment exactly. Any idea why he's so keen on sequential keys? They're mentioned as desirable several times in the article.
The issue, particularly with innodb on MySQL is that the row data is stored with the primary key. By using GUID's and not sequential ids, you end up having to rearrange the entire dataset to insert the row data at the appropiate place, to keep the index in order, instead of appending it, killing your write performance.