Hacker News new | ask | show | jobs
by Lazare 1876 days ago
First, for something like this, the details matter a lot. How many bits of randomness is this, how many bits used for the timestamp, what's the format, why does it yield the advantages claimed, and most of all, how does it provide collision resistance? Is it using a MAC address (like UUID v1/v6) or a custom namespace (like UUID v5), or what?

In this case...looking at the code.... It exposes two formats.

Format 1: 4 bytes to store the number of seconds since a custom epoch of... Sep 13 2020 12:26:40, and 12 bytes of randomness. Puzzling choice.

Format 2: 8 bytes to store the number of nanoseconds since the same custom epoch, and 8 bytes of randomness.

And the collision resistance...doesn't exist at all; it's just relying on 12 (or 8) bytes of randomness and a time prefix. Which seems like it'll be totally fine in practice, but if you actually care about collisions (and in my experience, almost everyone who thinks they do really doesn't), this is unlikely to be an optimal choice.

Functionally, I think this is closest to KSUIDs (https://github.com/segmentio/ksuid) which are also the result of combining a timestamp (with a custom epoch) and some randomness, and then concatenating them in a way which is not a valid UUID, will work efficiently as a DB index, and is extremely unlikely to collide. But KSUIDs are much more widely adopted and better documented.

2 comments

Umm.. 12 bytes of randomness has 7*10^28 unique uuid per second. We generally take order of square root of this due to birthday paradox which means that if you generate less than something like 10^10 UUID per second you couldn't get collision.
Well, yes. As I said, it'll be "totally fine". :)

The problem is, the submission said "collision-free". This isn't "collision-free", it's "collisions are so unlikely you don't need to worry even under extremely conservative assumptions, assuming you have a decent source of randomness". And that's good enough for me, absolutely.

But...if that's good enough, then really, any of the common UUID and UUID-like schemes will be good enough. I'm taking the submission title to mean either "guaranteed collision-free" or, failing that, as "more collision resistant than some alternatives", or maybe "guaranteed collision-free in some specific use cases". But this doesn't seem to be any of those; it's just a timestamp and enough randomness that you'll be safe under reasonable assumptions.

In short, I'm not saying "don't use this because you'll get collisions", I'm saying "even though as a practical matter you won't see any collisions, I don't think the collision free label is appropriate".

Let's do some math. Assume you generate a million UUID per second. The probability of collision in a second is c=1-e^((-10^6)^2/(2*7*10^28))=-7*10^-18. The probability it will collide once in next 10 years is 1 in a billion(seconds in 10 year * c).
> you couldn't get collision.

You mean probably. With which you mean there's a low chance of this happening. For that one second. But there are many more seconds after that. Plus you have to assume your PRNG never has any weird hiccups - such as when someone does VM shenanigans.

This isn't "guaranteed" to be collision free at all.

I'm not playing dice when I want implement reliable systems.

The collision risk depends on the application and use case. In sufficiently large real systems, a probabilistic pseudo-UUID with only 96 bits of entropy has a collision probability that is very small but not so small that you can treat it as effectively zero if avoiding collisions is critical. I’ve seen multiple systems that generate unique identifiers at rates > 10^9 per second. The 128-bit size has a lot of advantages, but probabilistic identifiers are expressly disallowed for some applications and contracts because of the collision tail risk. Making the identifier larger than 128-bits has other significant costs so that generally is not an option either.

That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

There's also the risk of bad randomness sources and/or bugs.

One popular UUID library got a bug report stating: "We are generating about 1M UUID4 a day, and we are getting several hundred collisions a day". And so they were; turned out to be a bug/weird interaction between the OpenSSL library they were using for randomness and forking. (Details here, although it was all fixed years ago of course: https://github.com/ramsey/uuid/issues/80)

On paper, you should never, ever, ever see a collision when generating a mere million v4 UUIDs a day, much less hundreds of collisions. But that doesn't mean it can't happen!

This is also an interesting bit of analysis; comes from a company that processed a lot of UUIDs generated in browsers, checked, and discovered about 5 collisions per million UUIDs. Again, not what you'd naively expect! (Turned out to be mostly driven by misbehaving crawlers.) https://medium.com/teads-engineering/generating-uuids-at-sca...

> That aside, generating probabilistic UUIDs at the rates where this is an issue is a bad idea regardless for performance reasons.

Are you sure? Fast cryptographic hashes are around one cycle per byte on a recent processor, and slower methods aren't that much slower.

I'm curious, what systems were those that generate >10e9 UUIDs/s?
Many types of telemetry and sensor data models; vast networks of machines can generate a stupendous quantity of events. UUIDs are not good design choice in these cases but it is a hard requirement for some users for compatibility with legacy systems which require a 128-bit unique identifier for everything.

Another issue with probabilistic UUIDs, possibly more important than collisions, is that they don’t compress very well when stored.

POV: You're AWS and logging every API call.

Just as an example where you could easily reach high ballparks, though I doubt it's that many.

Maybe some some distributed computing applications on server farms otherwise? Certainly not anything that you'd store in a regular database.

laptop% date -d 'Sep 13 2020 12:26:40Z' +'%s' 1600000000
Yes, the offset is a round decimal number, but I'm not sure why that makes it a very good epoch. The code actually has it as "16 * 10 * 8", but it seems like an obvious optimization would be to hard code it as, yes, 1600000000. But if you're hard coding that, you could hard code 1609459200 (for a round date) or 1620073869 (for when the initial commit was made) or whatever instead.

Any choice will inevitably be arbitrary, but this feels like it may have been chosen for odd reasons to me.