Hacker News new | ask | show | jobs
by muxxa 3666 days ago
Am I the only one who thinks that primary keys should be derived from the actual data? That way it's impossible for two processes to accidentally create the same conceptual piece of data (which is still possible with uuids). It also makes it much easier to recover from situations when you have to quickly promote a slave to a master role without first verifying that the slave is up to date. The main bonus though is that the database is much more comprehensible, e.g. foreign keys are legible without having to join back to the primary table. The relational model for data is pretty cool and breaks down when e.g. UNION doesn't work if 2 rows differ only in an arbitrary integer primary key.
11 comments

The problem is there's very few examples of real-world data that actually matches a primary key - i.e. is guaranteed to be unique and never changes. I've been burned by this so many times. In reality, everything needs to at least have some capability to change.
Unique and never changes can be two things concatenated, not one thing.

If that concatenation is too long you can ram anything thru a hash to get a constant length smoothly distributed key. Its also fun to use "weak" hashes for this because it trolls wanna be security types who don't understand the application.

Given the above, some important things concatenated and hashed works. Can always add an application ID or importer ID or process ID or a timestamp of high enough resolution.

Obviously needs are different if you're trying to create a bank user database vs deduplicating sampled engineering data.

So something that's unique is NOW() (assuming low enough sample rate LOL) and something that never changes is a laser serial number, concatenate those and ram thru a hash to make it small and fit.

Concatenating helps you with uniqueness, but not with whether things change (if anything that's part of a composite key changes, it can no longer reliably be part of a key). And it's the changing part that's not realistic with natural keys.

Once you get into application ids / timestamps / whatever you're no longer really using a natural key in my mind, you're just making your own algorithm for a surrogate key.

> So something that's unique is NOW() (assuming low enough sample rate LOL) and something that never changes is a laser serial number, concatenate those and ram thru a hash to make it small and fit.

That sounds very similar to UUID version 1 except that you replaced MAC address with some other identifier and added a hashing step.

> Unique and never changes can be two things concatenated,

No, for a good primary key, the "unique" parts can be due to concatenation, but the whole key (and thus all the elements) needs to be unchanging. [0]

[0] Modern databases can actually deal with changeable PKs, though with a potentially serious performance hit, but in many use cases where data travels outside the database for some kind of interaction where the results need to get reentered into the database, this is still a problem, since you can't (in any general way) cascade updates to things (which may not even be online systems, e.g., paper records feeding human processes) outside of the DB.

This is a lot of thinking and work to replace an auto-incremented integer, which is simple and works fine in most cases, though.
> So something that's unique is NOW() (assuming low enough sample rate LOL)

We can all invent algorithms for nice unique values when we include the caveat "apart from the edge cases". It's those edge cases that bugger everything up.

"Hash of the data defining the entity identified" is how version 3 and 5 UUIDs work (plus a namespace). If you really want a weak hash use version 3 - it uses MD5.
The problem is usually with the assumptions about natural fields being unique, or not change over time. One day you realize they can actually change, or they are not unique. Changing the data model at that time to support a different primary key would be very difficult.
> Am I the only one who thinks that primary keys should be derived from the actual data?

Where a valid natural key exists, it absolutely should be used. Surrogate keys should only be used where there isn't a natural attribute (or composite of such attributes) that corresponds to the unique identity of a tracked entity.

However, that's very common in the real world.

Okay so you run a school and want to make a students database. What do you use a a PK?

1) Social security #? Fails when you have international student.

2) Last Name, First Name, Middle Name? Fails when you have a repeat name.

3) Last Name, First Name, Middle Name, Home Town, Start Year? I guess this works for most of the time. Now you need to join to this table from the classes table. So now you need all 5 keys duplicated in the classes table to do the join.

Simple Integer primary keys "suck" in that you are adding bogus data to your database that has no value.. but man, they sure solve a LOT of problems. You need to think fairly hard before you get rid of them. In some cases you totally can. But making your default datamodel include an Integer PK solves a LOT of problems.

I have actually done this for a classroom management app I made mostly for fun when I taught high school. At first I wanted to use the district issued ID number as the primary key for each student. Except some students came and started classes and needed attendance records before the district got around to issuing them their official numbers. So I fell back to auto-increment int as the primary key, with their "district_id" nullable, default null, but unique constrained as just another field equivalent to their first_name.
Also, names can change!
The problem is that a lot of people end up screwing that up and picking something that doesn't work well as a primary key. Also, just using an integer is often a great way to get started without thinking about things so much. That works for a lot of web apps.
Version 3 and 5 UUIDs are derived from the data. They can be a good fit some places, but you have to be sure the only (pre-hash) collisions you'll have are the collisions you want.
Then you'll end up with a lot of inconvenient compound keys.
That's called a natural key and I only use it for logins in user tables. I haven't found anything else that doesn't break down.
Isn't the timestamp pretty much the most likely thing to be unique in the data, generally?
Including timestamp may be appropriate for log data, where otherwise identical records differing by time should be considered different entities. Otherwise it loses the property asked for upthread that our primary key detects/eliminates duplication.
A natural primary key is best, yes, if you can find it.
That's what unique keys are for usually.