| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by benjiweber 4426 days ago

Wikipedia claims birthday paradox means

"only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%."

http://en.wikipedia.org/wiki/Universally_unique_identifier#R...

1 comments

mantrax5 4426 days ago

That's not as much as you might think in computer systems.

Now, if you want to mark your product SKU with an UUID you have an outstanding chance it'll be unique amongst other UUID SKUs in the scope of every shop selling your SKUs.

But people have been captured by the idea UUID is truly "universally unique", so you can use it for everything, at any volume.

Not at all. Let's say that we're implementing a distributed Actor system (or a similar message passing architecture) at the scale of Google. We'll be using UUID to tag every message to guarantee its identity and various messaging properties (like deliver just once etc.), because UUID is unique. While actor systems can reach millions of messages per server per second, here we'll use a humble 100k messages per second.

- They have over 2 million servers (2,000,000)

- Each of which generates 100k messages per second.

You reach a point of 50% collision rate (for every new UUID) in 3 years:

    log2 (2,000,000 * 100,000 * 3600 * 24 * 365 * 3)= ~64

2^64 is roughly the number of 128-bit UUIDs you need to reach that collision rate due to the Birthday paradox.

Now you might say, I have a lot of variables pegged to worst case scenario. Sure I have. But this is about 50 damn percent chance of collision on every new UUID generated at that point.

Things will become bad long before 50% collision chance if you use UUID.

link

Someone 4426 days ago

"But this is about 50 damn percent chance of collision on every new UUID generated at that point."

That's incorrect. If you go back to the birthday paradox, what you say would mean that, if you have 23 people in a room, and a 24th walks in, there is a 50% chance that that new person shares their birthday with one of those 23. That clearly is incorrect; that probability is at most 23/365 < 0.10.

Also, I don't see how your formula proves it. You will go through 64 of the 128 bits of key space in 3 years, but that means you only got through 2^-64 part of the key space, so each next UUID has a chance of about 2^-64 of a collision.

It's the sheer number of lottery draws with ever increasing probability of a loss that introduces the birthday paradox, not the last lottery.

link

mantrax5 4426 days ago

Indeed, thanks for clarifying my overly dramatic point with a more proper definition of the Birthday paradox.

Yet, even once corrected, 50% chance for collision in a set of 2^64 UUID (which can occur way earlier than 2^64) is way too much for me to go to sleep at night, knowing there are more efficient, smaller (often twice smaller, ex. 64bit PK instead of 128bit PK), guaranteed collision-free ways for producing PKs.

And there's another problem with UUID collision rates. Many versions of the UUID rely on PRNG, and PRNG quality varies wildly from system to system.

A defect in the PRNG (it has happened) can start producing UUID with orders of magnitude higher collisions than the model offers. It's a problem that's better not to have.

link

andrewaylett 4426 days ago

While I know it's out of favour, you don't have to create type 4 (random) UUIDs. Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'.

This may bring issues in untrusted contexts, but it doesn't involve global (or indeed any) synchronisation outside of each machine, and you may define your 'machine' for UUID purposes to be any size you like -- one per core may work, or one per application, or per application thread even.

link

mantrax5 4426 days ago

> "Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'."

That's a pretty big if there, especially considering how unreliable timers are, and the fact they can jump back and forth.

- Wall clock timers will jump forward or back and repeat time after NTP adjustment, or anyone else who adjusts the clock.

- Internal "elapsed time" timers, based on say the CPU clock may jump around or repeat when the source core changes on multicore machines, or you have processes sourcing their CPU clock from the core they're bound on.

- A node may be moved from machine to machine (and their clocks are not necessarily perfectly synched).

This is a hard lesson that distributed system designers learn over and over again: don't rely on random and don't rely on timers for identity. Good old incrementing counters have none of those issues and are absolutely trivial to implement. They never fail, never drift, never repeat, by definition. As I said a few posts above, the typical structure of a "unique id" based on a counter has three counters:

    1. nodename (or machine name if you will)

    2. namespace (you can have one per thread to avoid contention issues)

    3. local (monotonically increases within the namespace).

Let's say your node id is 256, and it runs on 4 threads, one per core + 1 thread for the scheduler. You need five namespaces:

256.0-4.* (where * is increasing monotonically 0, 1, 2, 3, 4...)

And once you have to reboot the node, you obtain five new namespaces:

256.5-9.*

And the * counters start over from zero on each. As for how big should each segment be, that depends on what you want to do with the nodes I suppose. But the 128-bit UUID size should never be some kind of guide regarding size; maybe you need three shorts, maybe you need three longs, maybe short.medium.long etc., it'll depend on the use case.

link

cnvogel 4426 days ago

But that probability includes the possibility for the first generated message to collide with the last one, and to store that many messages (100k/sec, 2mill servers, 8 bytes/UUID) is 4*10^17 (400 Petabyte) allone.

I guess most of the messages created in such a system will be emphemeral...

link