Hacker News new | ask | show | jobs
by Someone 4426 days ago
"But this is about 50 damn percent chance of collision on every new UUID generated at that point."

That's incorrect. If you go back to the birthday paradox, what you say would mean that, if you have 23 people in a room, and a 24th walks in, there is a 50% chance that that new person shares their birthday with one of those 23. That clearly is incorrect; that probability is at most 23/365 < 0.10.

Also, I don't see how your formula proves it. You will go through 64 of the 128 bits of key space in 3 years, but that means you only got through 2^-64 part of the key space, so each next UUID has a chance of about 2^-64 of a collision.

It's the sheer number of lottery draws with ever increasing probability of a loss that introduces the birthday paradox, not the last lottery.

1 comments

Indeed, thanks for clarifying my overly dramatic point with a more proper definition of the Birthday paradox.

Yet, even once corrected, 50% chance for collision in a set of 2^64 UUID (which can occur way earlier than 2^64) is way too much for me to go to sleep at night, knowing there are more efficient, smaller (often twice smaller, ex. 64bit PK instead of 128bit PK), guaranteed collision-free ways for producing PKs.

And there's another problem with UUID collision rates. Many versions of the UUID rely on PRNG, and PRNG quality varies wildly from system to system.

A defect in the PRNG (it has happened) can start producing UUID with orders of magnitude higher collisions than the model offers. It's a problem that's better not to have.

While I know it's out of favour, you don't have to create type 4 (random) UUIDs. Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'.

This may bring issues in untrusted contexts, but it doesn't involve global (or indeed any) synchronisation outside of each machine, and you may define your 'machine' for UUID purposes to be any size you like -- one per core may work, or one per application, or per application thread even.

> "Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'."

That's a pretty big if there, especially considering how unreliable timers are, and the fact they can jump back and forth.

- Wall clock timers will jump forward or back and repeat time after NTP adjustment, or anyone else who adjusts the clock.

- Internal "elapsed time" timers, based on say the CPU clock may jump around or repeat when the source core changes on multicore machines, or you have processes sourcing their CPU clock from the core they're bound on.

- A node may be moved from machine to machine (and their clocks are not necessarily perfectly synched).

This is a hard lesson that distributed system designers learn over and over again: don't rely on random and don't rely on timers for identity. Good old incrementing counters have none of those issues and are absolutely trivial to implement. They never fail, never drift, never repeat, by definition. As I said a few posts above, the typical structure of a "unique id" based on a counter has three counters:

    1. nodename (or machine name if you will)

    2. namespace (you can have one per thread to avoid contention issues)

    3. local (monotonically increases within the namespace).
Let's say your node id is 256, and it runs on 4 threads, one per core + 1 thread for the scheduler. You need five namespaces:

256.0-4.* (where * is increasing monotonically 0, 1, 2, 3, 4...)

And once you have to reboot the node, you obtain five new namespaces:

256.5-9.*

And the * counters start over from zero on each. As for how big should each segment be, that depends on what you want to do with the nodes I suppose. But the 128-bit UUID size should never be some kind of guide regarding size; maybe you need three shorts, maybe you need three longs, maybe short.medium.long etc., it'll depend on the use case.