Hacker News new | ask | show | jobs
by twiss 310 days ago
> The chances of generating two GUIDs that are the same is astronomically small.

> The odds are 1 in 2^122 — that’s approximately 1 in 5,000,000,000,000,000,000,000,000,000,000,000,00.

This is true if you only generate two GUIDs, but if you generate very many GUIDs, the chance of generating two identical ones between any of them increases. E.g. if you generate 2^61 GUIDs, you have about a 1 in 2 chance of a collision, due to the birthday paradox.

2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).

5 comments

i have seen in my life two guid collisions already. and i'm not that old.

One of them was genuine - generated by different systems, and it was caught when loading data from one to another - object had same ID, but different underlying type.

Other one was due to 'error' - two systems(by different companies, supporting the same data exchange standard) used magic hardcoded guid that turned out to be the same.

Both of those systems have full audit trail - each change created new row in database and IDs were formatted as {NAMESPACE}.{GUID}.{TIMESTAMP}. Mutation of an object created new entry with different {TIMESTAMP} part. Namescapes are mandated by standard, so different systems can have the same namespace value.

There are either bugs in the system or the GUID isn’t random. The first case you mention is probably both TBH; the second case is probably due to non-randomness (generating via namespace/timestamp leads to collisions when two objects are generated simultaneously).
Sorry, when I was young I did not know what these `public static final UUID` mean, so I copied them.
Both vendors probably copied that GUID from the same place.
The birthday paradox simplified : if you generate n bits of random data, you can at most generate n/2 bits of random numbers before clashes start to occur. That's square root of number's range.

So if you need 1000 random numbers, generate from 1 to 1 million.

> So if you need 1000 random numbers, generate from 1 to 1 million.

If you don't check for clashes, the 50% chance of failure is too much. Probably even 0.1% is too much, so you'd need more elaborate approach.

If you do check for clashes, you can generate from 1 to 2000 with little overhead.

You can also look at the expected number of collisions instead, which is approximately the number of random numbers squared, divided by the size of the space of random numbers.

Then you can choose how many collisions to accept on average. (If the answer is zero, then it makes more sense to look at the probability of one or more collisions.)

I always assumed that intuitively... I think the number is 20 people for the birthday paradox. 20 x 20 = 400, and there are ~365 days in a year. Is that how that works?
The actual number is 23: https://en.wikipedia.org/wiki/Birthday_problem

The square root approximation works well for large numbers, but leaves out some factors that are relevant for small numbers.

I was always surprised the math maths for birthdays. Human birthdays are not random, and cluster around various dates and seasonal patterns.
Here is a statistical analysis of birthdays https://www.zippia.com/advice/most-least-common-birthdays/
Doesn't the clustering make collisions strictly more likely?
2^61 isn't even that large, well within the compute budget of mere mortals.
Counting to 2^61 probably is.

To actually find a collision in 128b cryptographic hash function it would take closer to 2^65 hashes. Back of the envelope calculations suggest that with Pollard's rho it would cost a few million dollars of CPU time at Hetzner's super-low prices. Not nearly mere mortals budget, but not that far off I guess.

A GUID is not a cryptographic hash function.

In any case, in 2023 I back-of-the-envelope estimated that you could compute 2^64 SHA256 for ~$100K, using rented GPU capacity https://www.da.vidbuchanan.co.uk/blog/colliding-secure-hashe...

That's great analysis. As you call out in the post, the 2^64 value is used to attack SHA256-128 (SHA256 truncated to 128 bits). NIST recommends at least SHA-224, which makes sense given your conclusions.
Depends on what “isn’t even that large means”. A modern 6ghz machine would probably need 12 years of 24/7 operation to count that high. To me that seems like a lot.
That's assuming 1 IPC, and no parallelism. A desktop-class zen5 CPU has 32 threads, with AVX512. Pipelining gets you up to 2048 bits of SIMD throughput per core per clock cycle: https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teard...

So assuming you use 64-bit counters, you can divide those 12 years by 1024 to get 4 days.

And that's not even considering what you could do on a GPU.

Edit: I might be off by a factor of 2, not sure if the SIMD throughput is per-core or per-thread. Also thermal throttling. Same ballpark though!

Well UUID generation isn’t going to be quite as SIMDable as counting so the analogy breaks down there partially because of that. And += 1 isn’t a very SIMDable operation? Unless I guess you create a mask of +1, +2, +3, +4 and add that to your base number to generate those offsets (which only works with avx512 - avx2 can only do 2 increments since these are 64bit integers)

Then your 32 HT threads aren’t really going to give you full access to the underlying SIMD registers which are going to be per core which is where I assume you realized the 2x difference might show up?

And to do += 1 multithreaded you have to partition the range or you won’t get any speed up - if you don’t amortize the cost of atomic synchronization across threads you’re going to be going slower than a non-SIMD increment.

Yeah, but a nation state server farm can probably cut that down to minutes because their budget can buy a lot of processors. You only need a few hundred to really shrink it down to manageable numbers. And it turns out that nation starts aren't the only ones that have this budget
What's the threat here?

It's trivial to force a collision. Here's the same UUID twice:

6e197264-d14b-44df-af98-39aac5681791

6e197264-d14b-44df-af98-39aac5681791

Typically, you don't care about UUIDs that aren't in your system and you generate those yourself to avoid maliciously generated collisions. Your system can't handle 2^61 IDs. It doesn't have the processing power, storage, or bandwidth for that to happen. Not to mention traditional rate limiting.

The last several comments were responding to

>2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).

I'm not sure. 6ghz is around 2^61 CPU cycles in 12 years. I.E. basic CPU instructions; counting, not computing a cryptographic hash. Otherwise, where is the cluster that's bruteforcing ~122 bit cryptographic hash collisions in minutes?
I think you might have trouble if you tried to assign one to every iron atom in an iron filing.
2^61 guids is... 36 exabytes, if my napkin math is correct. when storing them in binary format(16 bytes each) if doing the javascript thing and storing them as strings... (shudders) I don't even want to think about it.

Anyhow that was my first thought when you mentioned 2^61 guids, where are you even going to put them? second thought, I don't think enumerating 2^61 guids is trivial, in fact, I suspect it would take longer than anyone would be willing to spend, and if you are not storing them why are you generating them?

And what even is a guid collision attack? it is not like they are a hash, and since they tend to be public identifiers it turns out despite their stated use to prevent collisions, you can't really use guids generated by others(if they wanted collisions they would straight up just copy yours) so you end up regenerating them anyway.

* not the birthday paradox, but the birthday bound.