Hacker News new | ask | show | jobs
by jerrysievert 2952 days ago
uuid's aren't guaranteed to be unique at generation, there is still a non-zero chance of it having a collision. using it as a primary key to be generated by the database helps mitigate that, as there will normally be a uniqueness clause on the index.

creating that uuid on the client likely will not accomplish what you're hoping.

4 comments

It's pretty unlikely to get a collision[1], and the client should be able to handle the gracefully (regenerate the UUID and retry).

[1] https://www.quora.com/Has-there-ever-been-a-UUID-collision

The rarity of UUID collisions depends on the quality of implementation. The Quora answers assume UUIDv4 with a high-quality RNG.

I was able to reproduce UUIDv1 collisions at will when the timestamp had microsecond resolution and the clock sequence had to be generated randomly each time. That is, I simply had to get two processes to generate the same fourteen bits within a microsecond.

That only seems possible if both processes were running on the same physical machine (and hence share the same input MAC address).

It's a nasty corner case, but I'm sure the UUID designers considered it a valid tradeoff, since I believe the algorithm was designed for a distributed scenario. In the single-machine case an atomic counter would be a much easier solution with very reasonable efficiency anyway. Still, it might have been clever to also include the local process id in the UUID, I wonder why they didn't to that.

At any rate the problem is easily worked around by running the two processes on different machines, i.e. ensuring you have at most one UUID generating process per host (with respect to the database table in question).

> It's pretty unlikely to get a collision

unlikely is not zero, which was why I commented. I'd hate to rely on uniqueness of something that has a chance of not being unique.

If you generate 1 billion UUIDv4s a second, it would take, on average, 85 years for you to produce a duplicate, and the resulting list of UUIDs would take up ~45 exabytes. And keep in mind that even if inserting a row fails because you've somehow managed to generate a duplicate UUID, it is trivial to make a new UUID and retry. Since the database enforces the uniqueness constraints of primary keys, I'm hard pressed to come up with a scenario in which generating a duplicate UUID would actually do anything serious.
GP pointed out the special version of the algorithm that handles collisions gracefully. Note that this algorithm ("open addressing") is so frequently used that you will probably find a variation of it in pretty much any piece of software you have ever used. It's a well-understood method, not only from a practical perspective but also in terms of theory; check out e.g.: https://en.wikipedia.org/wiki/Linear_probing
Primary keys on postgresql are unique (and non null). It’s not possible to change that behavior as it’s the definition of a PK field.

You can thus create more UUID fields that are unique and non-null by specifying that instead of “primary key” on column creation.

https://www.postgresql.org/docs/8.1/static/ddl-constraints.h...

Can't you just enforce uniqueness anyways for that field? And then have the client retry?
Yes, we are planning to just fail, as the chance of a collision is really low and the user will then be able to just retry creating the object.
You can easily guarantee uniqueness within the system without any coordination.
> uuid's aren't guaranteed to be unique at generation

If they are not unique, they are not really UUIDs. In which case you should tweak the algorithm to make uniqueness guaranteed. Like add a client id and its logical time in there.

The difference here seems to be the exact definition of "unique".

A UUID has a finite length, so if we generate N new UUIDs in a finite time-interval it seems clear that - in theory - we must have collisions for large values of N (or even infinite N), regardless of how clever we are in seeding our random number generator. But I don't think this can be fixed without using a variable length value or refusing to mint new UUIDs once the available bits are used up.

Of course in practice that should not really happen; at least if we only run on hardware from vendors where we can assume that every MAC address will be unique, the only way to actually get a colission - in practice - would be by generating more than 2^B UUIDs on the same machine within a fairly short timeframe and fairly large B.

EDIT: In v1 of the algorithm it seems that B=14 and the "short timeframe" is the smallest resolution increment your system clock supports. That is if we assume the "uniquifying" clock sequence is produced by incrementing a counter. So we can say that for practical purposes, a collision is impossible on a modern system unless we have invalid MAC assignments to our hardware, unreasonable transaction rate, or an incorrect implementation of the UUID algorithm.

Am I missing something?

> by generating more than 2^B UUIDs on the same machine within a fairly short timeframe and fairly large B

Which can be made impossible.

IDs are part of the finite system and don't exist on their own. And the system can perform a finite number of operations both concurrently and in a finite time-interval with some amount of uniqueness distinguishing nodes and other useful properties. Making it possible to always find an id generating algorithm for this system that can never have collisions.

I agree, if we are allowed to make the UUID arbitrarily large (and not be limited to the 128 bits from the "official" UUID algorithm), it should always be possible to set B to a large enough value so that our scheme assigns a unique and finite-sized UUID to every possible state of the system, since all the system's parameters (number of servers, time, etc) should be discrete and finite. I.e. basically make B large enough that 2^B becomes greater than the number of discrete timesteps in our experiment. That's an interesting observation.
You can get away with a lot less than 128 bits for most systems. It's only that big because it tries to be universal.