| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mkleczek 189 days ago

> The UUID would be an example of an external key (for e.g. preventing crawling keys being easy). This article mentions a few reasons why you may later decide there are better external keys.

So we are talking about "external" keys (ie. visible outside the database). We are back to square one: externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).

It does not matter if they are random or not.

> How are you going to trace all those records if the requester has changed their name, phone number and email since they signed up if you don't have a surrogate key?

And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.

If you keep information about the time information was captured, you can at least ask me "what was your phone number last time we've interacted and when was it?"

> I think that spirals into way more complexity than you're thinking.

This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.

DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.

Event sourcing is a somewhat convoluted way to attack this problem as well.

> Those queries are incredibly simple with surrogate keys: "SELECT * FROM phone_number_changes WHERE user_id = blah".

Sure, but those queries are useless if you just don't know user_id.

3 comments

dpark 189 days ago

> externally visible surrogate keys are problematic because they are detached from real world information they are supposed to identify and hence don't really identify anything (see my example about GDPR).

All IDs are detached from the real world. That’s the core premise of an ID. It’s a bit of information that is unique to someone or something, but it is not that person or thing.

Your phone number is a random number that the phone company points to your phone. Your house has a street name and number that someone decided to assign to it. Your email is an arbitrary label that is used to route mail to some server. Your social security number is some arbitrary id the government assigned you. Even your name is an arbitrary label that your parents assigned to you.

Fundamentally your notion that there is some “real world” identifier is not true. No identifiers are real. They are all abstractions and the question is not whether the “real” identifier is better than a “fake” one, but whether an existing identifier is better than one you create for your system.

I would argue that in most cases, creating your own ID is going to save you headaches in the long term. If you bake SSN or Email or Phone Number throughout your system, you will make it a pain for yourself when inevitably someone needs to change their ID and you have cascading updates needed throughout your entire system.

link

halffullbrain 189 days ago

In my country, citizens have an "ID" (a UUID, which most people don't know the value of!) and a social security number which they know - which has all the problems described above). While the social security number may indeed change (doubly assigned numbers, gender reassignment, etc.), the ID needn't change, since it's the same physical person.

Public sector it-systems may use the ID and rely on it not changing.

Private sector it-systems can't look up people by their ID, but only use the social security number for comparisons and lookups, e.g. for wiping records in GDPR "right to be forgotten"-situations. Social security numbers are sortof-useful for that purpose because they are printed on passports, driver's licenses and the like. And they are a problem w.r.t. identity theft, and shouldn't ever be used as an authenticator (we have better methods for that). The person ID isn't useful for identity theft, since it's only used between authorized contexts (disregarding Byzantine scenarios with rogue public-sector actors!). You can't social engineer your way to personal data using that ID unless (safe a few movie-plot scenarios).

So what is internal in this case? The person id is indeed internal to the public sector's it-systems, and useful for tracking information between agencies. They're not useful for Bob or Alice. (They ARE useful for Eve, or other malicious inside actors, but that's a different story, which realistically does require a much higher level of digital maturity across the entire society)

link

everforward 188 days ago

> It does not matter if they are random or not.

Again, sometimes it does, the article lists a few of them. Making it harder to scrape, unifying across databases that share a keyspace, etc.

> And how does surrogate key help? I don't know the surrogate key that identifies my records in your database. Even if you use them internally it is an implementation detail.

That surrogate key is linked to literally every other record in the database I have for you. There are near infinite ways for me to convert something you know to that surrogate key. Give me a transaction ID, give me a phone number/email and the rough date you signed up, hell give me your IP address and I can probably work back to a user ID from auth logs.

The point isn't that you know the surrogate key, it's that _everything_ is linked to that surrogate key so if you can give me literally any info you know I can work back to the internal ID.

> This complexity is there whether you want it or not and you're not going to eliminate it with surrogate keys. It has to be explicitly taken care of.

Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.

Mine is just something like:

SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;

Could be a single query, but you get the idea.

> DBMSes provide means to tackle this essential complexity: bi-temporal extensions, views, materialized views etc.

This kind of proves my point. If you need bi-temporal extensions and materialized views to tell a user what their email address is from a transaction ID, I cannot imagine the absolute mountain of SQL it takes to do something more complicated like calculating revenue per user.

link

mkleczek 187 days ago

I am not sure you are arguing against my claims or not :)

I am not arguing against surrogate keys in general. They are obviously very useful _internally_ to introduce a level of indirection. But if they are used _internally_ then it doesn't really matter if they are UUIDs or sequence numbers or whatever - it is just an implementation detail.

What I claim is that surrogate keys are problematic as _externally visible_ identifiers.

> Okay, then lets do an exercise here. A user gives you a transaction ID, and you have to tell them the date they signed up and the date you first billed them. I think yours is going to be way more complicated.

> Mine is just something like:

> SELECT user_id FROM transactions WHERE transaction_id=X; SELECT transaction_date FROM transactions WHERE user_id=Y ORDER BY transaction_date ASC LIMIT 1; SELECT signup_date FROM users WHERE user_id=Y;

I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?

link

dpark 187 days ago

> I think you are missing the actual problem I am talking about: where does the user take the transaction ID from? Do you expect the users to remember all transaction IDs your system ever generated for them? How would they know which transaction ID to ask about? Are they expected to keep some metadata that would allow them to identify transaction IDs? But if there is metadata that enables identification of transaction IDs then why not use it instead of transaction ID in the first place?

Your notion that you can avoid sharing internal ids is technically true, but that didn’t mean it’s a good idea. You’re trying force a philosophical viewpoint and disregarding practical concerns, many of which people have already pointed out.

But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers. These are indexes back into some system. Because the alternative is that your customer calls you up and says “so I bought this thing last week, maybe on Tuesday?” And it’s most likely possible to eventually find the transaction this way, but it’s a pain and usually requires human investigation to find the right transaction. It’s wasteful for you and the customer to do business this way if you don’t have to.

link

mkleczek 186 days ago

> Your notion that you can avoid sharing internal ids is technically true, but that didn’t mean it’s a good idea. You’re trying force a philosophical viewpoint and disregarding practical concerns, many of which people have already pointed out.

What some call "philosophical viewpoint" I call "essential complexity" :)

> But to answer your question, yes, your customer will probably have some notion of a transaction id. This is why everyone gives you invoice numbers or order numbers.

We are in agreement here: externally visible identifiers are needed for many reasons (mostly technical). The discussion is not about that though but about what information should be included in these identifiers.

> This is why everyone gives you invoice numbers or order numbers.

And there are good reasons why invoice or order numbers are not randomly generated strings but contain information about the invoices and orders they identify.

My claim is that externally visible identifiers should possess a few characteristics:

* should be based on the data they identify (not detached from it)

* should be easy to remember (and that means they should be as short as possible, they should be easy to construct by a human from the data itself - so they cannot be hashes of data)

* should be versioned (ie. they should contain information somehow identifying the actual algorithm used to construct them)

* should be easy to index by database engines (that is highly db implementation dependent unfortunately)

* can be meaningfully sortable (that is not strictly a requirement but nice to have)

Coming up with an identifier having these characteristics is not trivial but is going to pay off in the long run (ie. is essential complexity).

link

dpark 186 days ago

Much of this is not essential complexity, but accidental complexity.

* Based on the data they identify - This is a minefield of accidental complexity. Data changes and needs to be redacted for GDPR and other data laws. What do you do when someone demands you delete all personally identifiable data but you’ve burned it into invoice ids that you need to retain for other legal reasons? This is also begging for collisions and very much at odds with making IDs short.

* easy to remember - This is a nice to have. Short is convenient for sharing on the phone. Memorable didn’t matter much. I don’t remember any invoice number I’ve ever received.

* versioned - Versioning is only interesting because you’re trying to derive from real data. Again, accidental complexity.

* easy to index - Sure.

* sortable - Nice to have at best.

link

mkleczek 186 days ago

> * Based on the data they identify

> * easy to remember

(which means human readable and related to the actual information which makes them easier to remember)

These actually are the most important features.

Example: transaction references not related to the actual subject of the transaction (ie. what is being paid for) is enabler for MITM scam schemes.

> Short is convenient

Nah. Short is crucial for identifiers to be effective for computers to handle (memory and CPU efficiency). Otherwise we wouldn't need any identifiers and would just pass raw data around.

> * versioned - Versioning is only interesting because you’re trying to derive from real data.

Nah. Even UUID formats contain version information.

> * easy to index - Sure.

> * sortable - Nice to have at best.

These are directly related (and in the context of UUIDv4 vs UUIDv7 discussion sortable is not enough - we also want them to be "close" to each other when generating so that they can be indexed efficiently)

link