Hacker News new | ask | show | jobs
by troyk 4426 days ago
I keep seeing the uuid idea for pkeys, but have yet to see them used by any big site. Last I looked, twitter, facebook are using bigints all over the place. Also, instagram went bigint and blogged about it here http://instagram-engineering.tumblr.com/post/10853187575/sha...

Also, I tried the uuid extension once... It is not well supported, had to make a small change to a c header file to get it to compile on ubuntu, for dev on os x I think I gave up.

2 comments

The "api userID" for facebook/twitter/etc. is numeric because it's always been numeric (and facebook explicitly doesn't guarantee anything about their lengths IIRC); that doesn't imply anything about what their real pkey is, just that it was numeric back in 2003. It's pretty easy to transform a UUID back and forth into a number.

I've used UUID-as-pkey and had it work well.

UUID aren't sortable by order of creation (on a per server basis which is useful in distributed algorithms), and while you'll hear a lot of talk about meteors, lightnings and winning the lottery, you do risk collisions (yes, when it comes to thousands of rps per server, millions of servers, and extended periods of time, you reach ridiculous collision rates even on a 128-bit UUID, thanks to the Birthday paradox).

UUID has no guaranteed properties you can rely on in high volumes.

The only desirable property it has is that it's an easy way out in tutorials.

The better solution is simply longer to explain (have a unique name for each server, an incremental generation number every time the server is restarted, and an incremental number on top of that for the runtime of a given generation - way Erlang does it).

Wikipedia claims birthday paradox means

"only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%."

http://en.wikipedia.org/wiki/Universally_unique_identifier#R...

That's not as much as you might think in computer systems.

Now, if you want to mark your product SKU with an UUID you have an outstanding chance it'll be unique amongst other UUID SKUs in the scope of every shop selling your SKUs.

But people have been captured by the idea UUID is truly "universally unique", so you can use it for everything, at any volume.

Not at all. Let's say that we're implementing a distributed Actor system (or a similar message passing architecture) at the scale of Google. We'll be using UUID to tag every message to guarantee its identity and various messaging properties (like deliver just once etc.), because UUID is unique. While actor systems can reach millions of messages per server per second, here we'll use a humble 100k messages per second.

- They have over 2 million servers (2,000,000)

- Each of which generates 100k messages per second.

You reach a point of 50% collision rate (for every new UUID) in 3 years:

    log2 (2,000,000 * 100,000 * 3600 * 24 * 365 * 3)= ~64
2^64 is roughly the number of 128-bit UUIDs you need to reach that collision rate due to the Birthday paradox.

Now you might say, I have a lot of variables pegged to worst case scenario. Sure I have. But this is about 50 damn percent chance of collision on every new UUID generated at that point.

Things will become bad long before 50% collision chance if you use UUID.

"But this is about 50 damn percent chance of collision on every new UUID generated at that point."

That's incorrect. If you go back to the birthday paradox, what you say would mean that, if you have 23 people in a room, and a 24th walks in, there is a 50% chance that that new person shares their birthday with one of those 23. That clearly is incorrect; that probability is at most 23/365 < 0.10.

Also, I don't see how your formula proves it. You will go through 64 of the 128 bits of key space in 3 years, but that means you only got through 2^-64 part of the key space, so each next UUID has a chance of about 2^-64 of a collision.

It's the sheer number of lottery draws with ever increasing probability of a loss that introduces the birthday paradox, not the last lottery.

Indeed, thanks for clarifying my overly dramatic point with a more proper definition of the Birthday paradox.

Yet, even once corrected, 50% chance for collision in a set of 2^64 UUID (which can occur way earlier than 2^64) is way too much for me to go to sleep at night, knowing there are more efficient, smaller (often twice smaller, ex. 64bit PK instead of 128bit PK), guaranteed collision-free ways for producing PKs.

And there's another problem with UUID collision rates. Many versions of the UUID rely on PRNG, and PRNG quality varies wildly from system to system.

A defect in the PRNG (it has happened) can start producing UUID with orders of magnitude higher collisions than the model offers. It's a problem that's better not to have.

While I know it's out of favour, you don't have to create type 4 (random) UUIDs. Including a machine identifier and timestamp in the generation of your UUID guarantees uniqueness, so long as each machine is careful not to generate two UUIDs 'simultaneously'.

This may bring issues in untrusted contexts, but it doesn't involve global (or indeed any) synchronisation outside of each machine, and you may define your 'machine' for UUID purposes to be any size you like -- one per core may work, or one per application, or per application thread even.

But that probability includes the possibility for the first generated message to collide with the last one, and to store that many messages (100k/sec, 2mill servers, 8 bytes/UUID) is 4*10^17 (400 Petabyte) allone.

I guess most of the messages created in such a system will be emphemeral...