Choosing the best database primary key – INT and UUIDs alternatives

Y	Hacker News new \| ask \| show \| jobs

	Choosing the best database primary key – INT and UUIDs alternatives (adileo.github.io)
	4 points by adileo 1328 days ago

2 comments

adileo 1328 days ago

While working on a big data & distributed system, I started questioning myself on which is the best database primary key to be used in terms of performance, security and, functionality. Despite some articles on Stackoverflow and a couple of blog post around the web highlighting the pros of some identifiers against others, I didn't found a complete overall picture. So I decided to create this comparison table.

My next step is to understand the best way to not expose the primary key to the world, maybe an additional column with a public random id? Maybe encoding/decoding the PK with some cryptographycally secure algorithm in the REST API layer? Or better a key-value mapping cache layer between the public Id and the internal Primary Key?

Every insight is welcomed.

link

gregjor 1328 days ago

I think the author confuses primary keys with surrogate keys. The article compares various surrogate keys in technical terms, but doesn't address "Choosing the best database primary key." A natural key usually makes the best primary in terms of database design, usability, and longevity. If you can't identify a natural key that indicates failure to normalize or even the wrong model of the data.

Optimizing for storage, "performance," or "exposing the primary key to the world" (why?) may matter in some cases, but at the expense of not having an actual relational model, just a pile of data wedged into a relational database management system.

link

adileo 1328 days ago

Author here, you're correct, the title is misleading and assumes a match between surrogate and primary keys in a simplistic way, especially for when you can't make use of a natural key. (most of the cases in the domain I work in)

If you can't identify a natural key that indicates failure to normalize or even the wrong model of the data.

I'm not sure about this point. As a matter of fact a lot of entities in the real world do not have a natural key, or if they have it, it could be subject to changes. How do you identify a user or an anonymous post on a forum?

Maybe for an user you could just use the email. Then, what if the user want to change their email? Or what if one day you want to allow users to sign-up with just their social login?

Surrogate keys are always a good level of indirection to keep a consistent identity, and especially useful as foreign key and hence also as a primary key. Of course with some exceptions (eg. intermediate tables, where the PK is spanning across multiple columns)

But I'm curious to know also other scenarios since my view is skewed by the business domain I work in.

link

gregjor 1328 days ago

Primary key seems like a synonym for surrogate key for a few reasons. I think ORMs almost always assume surrogate key-based schemas, for example. It seems that database design has defaulted to surrogate keys over time. My complaints about the article came from my interpretation of primary key equals a surrogate key, I apologize for reading more into it than intended.

Primary keys must satisfy a couple of constraints: uniqueness and not NULL. A primary key can change, and all modern RDBMSs can cascade primary key changes to foreign keys. But using a natural key like an email address does open up the possibility of duplicates, whereas surrogate keys never have that problem. With a schema that uses (say) email as a primary key you have decide between preventing adding a new user because of a duplicate key, or allowing the duplicate email address in the database and dealing with it some other way.

I try to use natural keys but as you point out that can get impractical and seem quixotic when surrogate keys solve real-world problems. I don't default to surrogate keys (and I don't use ORMs) but I often fall back on surrogate keys. And I've had to refactor databases away from natural keys to surrogate keys because the original key caused more problems than it solved.

As for making primary keys publicly visible, that comes up in web applications that include IDs in URLs. That information -- that I have user ID 2345 -- lets hackers try scanning the space of IDs to see what they get. UUIDs or some other non-sequential value prevents that to some degree, at the cost of dealing with long keys creating large indexes and more cache thrashing in the RDBMS. Which works best depends on the application and what other measures you can take to prevent hackers digging through your database with HTTP requests.

link

adileo 1327 days ago

Very interesting insight, as you say, probably ORMs instilled this way of thinking, while the primary key is not just constrained to that function I perfectly agree with that.

* at the cost of dealing with long keys creating large indexes and more cache thrashing in the RDBMS. *

This is actually one of the point I'm currently dealing with. Especially finding the right trade-off to manage records in the order of hundred millions/billions, the right decision could save GB of space both on disk and RAM and seconds in computation during each JOIN operation for example.

That's why I drafted out a comparison table, sometimes I see people start using UUIDs right away just because they are offered as out of the box in some ORMs or just because it souds "cool", without knowing that there are better or simplier alternatives out there.

link

gregjor 1327 days ago

I've run into the same problems. You compiled good useful information, I will bookmark. If you had named it "Choosing the best surrogate key format" I wouldn't have complained at all!

I have a customer with a fairly large database (university students). URLs on the site often have primary keys in them. They use autoincrement integers for the primary keys. Most operations require the user to log in first, so the system can check access to rows based on user permissions, so even if someone tries to change the key in the URL they won't get anything. For the handful of public-facing URLs that have a key in them we set up tokens (UUIDs) that map through an intermediate table to the real key. That way we get the benefits of simple incrementing integer keys for authorized users and internal operations, and hide those keys when presented to the outside world. Very much what you describe. We could not identify good natural keys for the main tables. Some tables have natural keys, i.e. the country table uses ISO country codes. The ordering/pagination issue you mention in your article falls outside the realm of proper relational database design but comes up a lot in real applications.

link