Hacker News new | ask | show | jobs
by jandrewrogers 1877 days ago
The collision risk depends on the application and use case. In sufficiently large real systems, a probabilistic pseudo-UUID with only 96 bits of entropy has a collision probability that is very small but not so small that you can treat it as effectively zero if avoiding collisions is critical. I’ve seen multiple systems that generate unique identifiers at rates > 10^9 per second. The 128-bit size has a lot of advantages, but probabilistic identifiers are expressly disallowed for some applications and contracts because of the collision tail risk. Making the identifier larger than 128-bits has other significant costs so that generally is not an option either.

That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

3 comments

There's also the risk of bad randomness sources and/or bugs.

One popular UUID library got a bug report stating: "We are generating about 1M UUID4 a day, and we are getting several hundred collisions a day". And so they were; turned out to be a bug/weird interaction between the OpenSSL library they were using for randomness and forking. (Details here, although it was all fixed years ago of course: https://github.com/ramsey/uuid/issues/80)

On paper, you should never, ever, ever see a collision when generating a mere million v4 UUIDs a day, much less hundreds of collisions. But that doesn't mean it can't happen!

This is also an interesting bit of analysis; comes from a company that processed a lot of UUIDs generated in browsers, checked, and discovered about 5 collisions per million UUIDs. Again, not what you'd naively expect! (Turned out to be mostly driven by misbehaving crawlers.) https://medium.com/teads-engineering/generating-uuids-at-sca...

> That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

Are you sure? Fast cryptographic hashes are around one cycle per byte on a recent processor, and slower methods aren't that much slower.

I'm curious, what systems were those that generate >10e9 UUIDs/s?
Many types of telemetry and sensor data models; vast networks of machines can generate a stupendous quantity of events. UUIDs are not good design choice in these cases but it is a hard requirement for some users for compatibility with legacy systems which require a 128-bit unique identifier for everything.

Another issue with probabilistic UUIDs, possibly more important than collisions, is that they don’t compress very well when stored.

POV: You're AWS and logging every API call.

Just as an example where you could easily reach high ballparks, though I doubt it's that many.

Maybe some some distributed computing applications on server farms otherwise? Certainly not anything that you'd store in a regular database.