Hacker News new | ask | show | jobs
by YetAnotherNick 1876 days ago
Umm.. 12 bytes of randomness has 7*10^28 unique uuid per second. We generally take order of square root of this due to birthday paradox which means that if you generate less than something like 10^10 UUID per second you couldn't get collision.
3 comments

Well, yes. As I said, it'll be "totally fine". :)

The problem is, the submission said "collision-free". This isn't "collision-free", it's "collisions are so unlikely you don't need to worry even under extremely conservative assumptions, assuming you have a decent source of randomness". And that's good enough for me, absolutely.

But...if that's good enough, then really, any of the common UUID and UUID-like schemes will be good enough. I'm taking the submission title to mean either "guaranteed collision-free" or, failing that, as "more collision resistant than some alternatives", or maybe "guaranteed collision-free in some specific use cases". But this doesn't seem to be any of those; it's just a timestamp and enough randomness that you'll be safe under reasonable assumptions.

In short, I'm not saying "don't use this because you'll get collisions", I'm saying "even though as a practical matter you won't see any collisions, I don't think the collision free label is appropriate".

Let's do some math. Assume you generate a million UUID per second. The probability of collision in a second is c=1-e^((-10^6)^2/(2*7*10^28))=-7*10^-18. The probability it will collide once in next 10 years is 1 in a billion(seconds in 10 year * c).
> you couldn't get collision.

You mean probably. With which you mean there's a low chance of this happening. For that one second. But there are many more seconds after that. Plus you have to assume your PRNG never has any weird hiccups - such as when someone does VM shenanigans.

This isn't "guaranteed" to be collision free at all.

I'm not playing dice when I want implement reliable systems.

The collision risk depends on the application and use case. In sufficiently large real systems, a probabilistic pseudo-UUID with only 96 bits of entropy has a collision probability that is very small but not so small that you can treat it as effectively zero if avoiding collisions is critical. I’ve seen multiple systems that generate unique identifiers at rates > 10^9 per second. The 128-bit size has a lot of advantages, but probabilistic identifiers are expressly disallowed for some applications and contracts because of the collision tail risk. Making the identifier larger than 128-bits has other significant costs so that generally is not an option either.

That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

There's also the risk of bad randomness sources and/or bugs.

One popular UUID library got a bug report stating: "We are generating about 1M UUID4 a day, and we are getting several hundred collisions a day". And so they were; turned out to be a bug/weird interaction between the OpenSSL library they were using for randomness and forking. (Details here, although it was all fixed years ago of course: https://github.com/ramsey/uuid/issues/80)

On paper, you should never, ever, ever see a collision when generating a mere million v4 UUIDs a day, much less hundreds of collisions. But that doesn't mean it can't happen!

This is also an interesting bit of analysis; comes from a company that processed a lot of UUIDs generated in browsers, checked, and discovered about 5 collisions per million UUIDs. Again, not what you'd naively expect! (Turned out to be mostly driven by misbehaving crawlers.) https://medium.com/teads-engineering/generating-uuids-at-sca...

> That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

Are you sure? Fast cryptographic hashes are around one cycle per byte on a recent processor, and slower methods aren't that much slower.

I'm curious, what systems were those that generate >10e9 UUIDs/s?
Many types of telemetry and sensor data models; vast networks of machines can generate a stupendous quantity of events. UUIDs are not good design choice in these cases but it is a hard requirement for some users for compatibility with legacy systems which require a 128-bit unique identifier for everything.

Another issue with probabilistic UUIDs, possibly more important than collisions, is that they don’t compress very well when stored.

POV: You're AWS and logging every API call.

Just as an example where you could easily reach high ballparks, though I doubt it's that many.

Maybe some some distributed computing applications on server farms otherwise? Certainly not anything that you'd store in a regular database.