Hacker News new | ask | show | jobs
by cetra3 1310 days ago
If you're using them for unguessable random strings then yeah, they're not ideal.

If you're using them for providing a unique id in a distributed system, with very little chance of collision & fitting them in a db column, then they are great.

9 comments

Pretty much, my first reaction was "people use UUIDs for session tokens ? why? ?

Seems like author made some bad choices in previous systems and now just figured out why tbh.

I’m not sure it’s bad to use a random UUID (v4) generated with a random number generator designed for cryptography for a validated session key.

A guess means making a request to your server. You won’t be concerned with ~2^64 guesses per second.

I’m not suggesting anyone do it, if you have a choice. (Especially consider you’ll probably have to go through the trouble to justify it to people who read articles like this but don’t understand the math.) But if you have an existing system, consider whether you can let it stand.

Well, existing system (that for whatever reason can't do CSPRNG-> base64) could always concat 2 UUIDs
Depending on the UUID algorithm, some are cryptographically sufficient true random, then it would make sense..
what if you sha it?
Adding a crypto hash allows to check that the hashed value was not changed, because finding another value with the same hash is hard, by definition of a crypto hash.

But here the problem is not forging an ID, it's guessing an ID, and hashing does not widen the search space, does not increase randomness.

> Adding a crypto hash

I think the poster you replied to was meaning using the hash output as the token, not that you would maintain the original token and a salted hash for verification.

If they are thinking SHA(GenerateUUID()) would have better entropy then they are incorrect even though all SHA variants output more than the 128-bits in the source UUID. I assume such misunderstanding comes from the fact that some PRNGs are based upon repeated application of cryptographically assured hash functions against the seed data.

Using some unreversible transform would solve the issue of potentially leaking information in the UUIDs, but if that is an issue then instead use a UUID variant based on purely random data (v4?) as that would be more efficient and not result in value that is longer but contains no extra entropy.

That actually reduces the usefulness as you're hashing the data into a smaller length.
It seems uuids are 128 bit, while sha is 160 bit. There is also sha256 and sha512 for longer hashed. So there shouldnt be any worries about the hash being shorter.
Rereading I am guessing you're merely pointing out that the comment regarding shortening the length is untrue. If you already understand the entropy issue here, please treat my "you"s as royal you's.

You have a 128 bit value. That's 128 binary digits. Each digit can be zero or one. That means you have 2^128 possible distinct values. (Ignoring the fixed bits in UUIDs since it's not important for sake of this argument.)

Now you use a one-way cryptographic hash on top, like sha256. This will return a specific hash for any given input. It is always the same for a specific given input, and it is nearly always distinct. The output that a hash has may have more bits, but the number of distinct values can't increase; it can only ever decrease. That's because you could only ever give it 2^128 different values. How could it ever return more outputs if each input corresponds to one output?

To make it more clear, let's say you have a database where you want to store a customer's zip code so you can use it as some kind of validation later on to ensure it matches, but you don't want to store it in plaintext, so you hash it. The hash is 160 bits. Secure, right? Wrong. There are less than 50,000 zip codes. It would be trivial to calculate the hash of every single one and use it as a simple hashmaps from hashed value to plaintext.

You may be thinking this is impractical for an input domain as large as 2^128, but realistically it only adds a slight roadblock. Knowing the only valid values will be hashed UUIDs, instead of picking 160 random bits, you'd be much better off picking a random UUID, hashing it, and trying that for each attempt.

Yes, some hashes might not meaningfully hurt it, but they won’t add any entropy, which is the real problem.
Not being snarky: what's the risk of using UUIDs for session tokens if they are created by the server/db and are always verified by server (db) (for authorisation etc)?
Well, V4 UUIDS per wiki are pretty random, but your generated UUID could actually use your MAC address and current time to be globally unique. So, less entropy. Just use them as a (globally) unique thing but not as a secret.
Basically, know your UUID generator type. V1, V2, V6 and V7 are mac/time dependant and more useful for f.ex. DB-keys whilst V4 is more useful for things that should actually be secret.
So there's nothing actually wrong with UUIDs as secrets, if you know what you're doing and how to mitigate the risks?

So pretty much the same as every other damn thing in software that gets an "X Considered Harmful" article? :-D

I would trust a reputable cryptographic random number generator library to really care about generating truly unguessable, high entropy cryptography-grade random numbers. I would trust a reputable UUID library to generate a UUIDv4 which is random enough to not produce a collision. I would not trust a reputable UUID library to generate truly unguessable, high entropy cryptography-grade UUIDv4s.
Not really. The articles point is that even a v4 UUID (the random one) doesn't have enough randomness as other options, and it has a much less compact representation.

UUIDs are not designed to be secrets, so they are a poor choice. They'll probably work, but there are better options.

If you know what you're doing and mitigating the risks, you don't waste your time trying to use UUIDs for secrets. Therefore people using UUIDs for secrets, by definition, don't know what they're doing and certainly aren't mitigating the risks.
From my experience even from bigger companies it is sometimes common practice.
Yeah I don't really get the point of this article, if you need random values of a specific size don't use uuid, it's literally specified to be one exact length and format.
>>Yeah I don't really get the point of this article,

To get clicks?

You're not wrong lol
The number of comments saying "using UUIDs for secrets isn't that bad" suggests this article needs to be written...
one exact length and five "versions" of the format (so far)

https://en.wikipedia.org/wiki/Universally_unique_identifier#...

I made a comparison list with the most known uuids out there, a couple of days ago, it was quite fun discovering all the different kinds of uid and their pros/cons.

https://adileo.github.io/awesome-identifiers/

KSUIDs are fairly popular and missing from your list:

https://github.com/segmentio/ksuid

what's the resolution on those? 32 bits, 100 years.. that seconds right? doesn't sound excellent for time ordering. 100 years also seems a little short but at least I'll be dead
Don't look at it as being your problem in 100 years, but as helping employment in 100 years and helping the economy ;)
ULID example should be in uppercase.

Love this chart tho.

Also most well-designed systems only use the UUID as the representation format and use raw bits in performance-critical parts.
The raw bits are the UUID, the hex string is just a human-readable representation that also plays nicely with JSON.
Tell that to Django (well 5 years ago anyways iirc, don't know what it does now). Pretty sure it used to store uuids as strings columns in your sql.
I suppose Django wouldn't consider the speed gains of using raw integers in the database worth the hassle of dealing with binary data when you have to manually deal with the database somehow. I usually use string columns for UUIDs myself for the same reason.

It's also not given that it'll be a performance benefit, you probably receive UUIDs as strings from some client and probably want to return UUIDs as strings to the client, and that conversion isn't free.

Yep, looks like it does the right thing in PostgreSQL but not anywhere else [0].

https://docs.djangoproject.com/en/4.1/ref/models/fields/#uui...

I feel like it did strings in postgres too, not too long ago and I had a <brain explode> moment when I worked on a codebase and had to figure out why queries were terrible
Or to PowerBI, which will any UUID to a string even in joins. That cast + string comparisons + killing of indexes is not conducive to performant queries...
It’s a 128 bit integer - the serialization format does not change the fact.
Use uint128_t instead.
It is also highly recommended that you include a check digit into it, to minimize the chance of a collision. I've used https://arthurdejong.org/python-stdnum for that purpose.
I don't see how a check digit minimizes the chance of collision. (Here, I'm assuming that a check digit is calculated from the other digits. What am I thinking about incorrectly?)
Looking at the docs for the library linked, it appears to be a Verhoeff algorithm check digit... so yeah, you're correct.

This is effectively a simplistic stand-in for a CRC type system -- useful to detect if the data has been corrupted, but not useful to avoid collisions.

And if someone is worried about UUID collisions, they need to rethink their priorities in life.
You are correct, this should teach me not to write comments when I'm too tired. :/

The check digit wouldn't really help with collisions, since if the strings are the same the digit will be too. They are primarily useful when we need to ensure correctness on human input.

There's probably a non-trivial amount of folks that equate a UUID with "unguessable" given their appearance. They are, after all, not sequential and using them to obscure things like number of users (using a UUID in place of an incrementing number) seems like a natural fit.

Given how easy it is to generate a UUID in most languages, and given the low likelihood of a collision within a system - it wouldn't be a huge leap to think UUID's could replace homebrewed random string generators for things like password reset tokens, etc.

> There's probably a non-trivial amount of folks that equate a UUID with "unguessable" given their appearance.

That's near enough to true for anyone not operating at "web scale".

FAANG/BAT engineers need to care. My systems with 10s or 100s of thousands of users (or, you know, a few thousand users tops) are without doubt going to be re-written (probably several times) well before I have to worry about having so many UUIDs in the wild that this becomes a reasonable thing to worry about.

For me, at the scale of systems I run (or will conceivably run in the medium term future), I think the simplicity/understandability of code that uses native language UUID functions is "the right thing". Whoever does the next big rewrite to support a few million MAU will be thankful they don't have to work out WTF I was thinking when I decided to roll my own random access tokens.

I doubt FAANG engineers need to care either. Ignoring that the author imagines 8k IoT devices per living human for one service, 2^64 requests per second is an absurd number to use. Assuming one server can do 10M RPS, you'd need 1.8 trillion servers to handle that load. You'd also need over 2 billion Tb/s of bandwidth to receive just the UUIDs with no overhead.

It doesn't matter what computing resources your attacker has; the limit is how much your infrastructure can handle, and the author casually overestimates that by about 10 orders of magnitude. So replace 35 minutes with 350 billion minutes, or about 660,000 years.

Thanks for this. I thought I must be missing something because this seems like such an obvious point.

I find it hard to believe that there is a problem with a (cryptographically random) 122 bit session key considering that a brute force attack on it will result in a DDoS, which is obviously self limiting.

Lots of people here are saying “never use a uuid for a session key”, but I don’t understand this. What’s the accepted entropy for a session key?

I think the even more absurd rec is to use 160 bits as a "sweet spot"? Why? Who said that? Which real world scenarios? Why not 159 or 161...

Then you realize the author is just talking out their rear end with no thought...

"Yes I often find my cracking buddies with their super computers just give up hacking my online user service when I bumped my user token length from 159 to 160 length", said nobody, ever.

> "Yes I often find my cracking buddies with their super computers just give up hacking my online user service when I bumped my user token length from 159 to 160 length", said nobody, ever.

Reminds me of this sketch: https://youtu.be/IHfiMoJUDVQ

Even they shouldn't need to be concerned much with collisions. Wikipedia suggests[0] "generating 1 billion UUIDs per second for about 85 years". Is it possible? Sure. Is it likely? Not really.

[0]: https://en.wikipedia.org/wiki/Universally_unique_identifier#...

I guess from the article it's not just collisions, but the (significantly more likely) problem of guessing a UUID that's valid (out of all the issued tokens).

But yeah, even that is very very low risk. The article had to make some outrageously pessimistic assumptions to get it's "38 minutes!" number. Issuing a million tokens a second with two year validity, and getting attacked with the entire hash rate of the bitcoin mining community. And having both enough backend capacity to handle all those requests while at the same time having no observability or rate limiting to mitigate a brute force attack.

> I guess from the article it's not just collisions, but the (significantly more likely) problem of guessing a UUID that's valid (out of all the issued tokens).

Assuming random UUIDs:

If you're counting all the UUIDs anyone makes, then valid<->attacker matches are a subset of all possible collisions and therefore less likely.

If your baseline is only the collisions between valid UUIDs, then whether an attacker is more or less likely to collide depends on whether they're generating UUIDs at least half as fast as the system they're attacking.

> That's near enough to true for anyone not operating at "web scale". FAANG/BAT engineers need to care.

I’d argue even then it’s really not much a concern. You’d need to generate 1 billion UUID v4’s per second for over 75 years to have a 50% chance of there being a single collision.

You can generate sequential UUIDs, IIRC, that’s the best way to store them in a db and still have good partitioning/indexing. I don’t use UUIDs often, but I vaguely remember researching this problem space at some point.
I think most languages let you chose which version of UUID you want - with most defaulting to the random version (I think 4?) by default.

There are other versions that are sequential/time-based though, but using these could open the door to de-obfuscating whatever data you wanted to protect via UUID's in the first place (like how many sales orders you receive per hour, etc).

I don’t think uuids are designed for obfuscation, though they certainly help with that as a side effect. I could be wrong though, I’ve never looked into it.
They (randomized type 4 UUID's) obfuscate as a side effect because they are much more difficult to guess due to their randomness. As the article points out though, they are not impossible to guess... but it will come down to your risk tolerance and what the UUID's are "protecting".

People like to reach for UUID's when obfuscation is needed because inventing your own duplicate-aware random string algorithm isn't what most folks want to spend their time thinking about. Plus, these days, many databases come with UUID-aware data types that make using UUID's fairly straight forward.

UUIDs are a vast improvement over integers for preventing simple attacks like +/-ing the id and seeing what happens.
But then you're back to collisions, and you may as well be using longs.
I think v7 uses microseconds since epoch + random data. The odds of a collision should be practically 0, or more likely to find a sha256 collision.
> more likely to find a sha256 collision.

This is obviously, and egregiously, false.

I don't know. You'd need quite a number of threads + machines generating uuids in the exact same microsecond to get an opportunity for a collision. It doesn't seem obviously false.
“Moving Away From Misusing UUIDs”
My only wish is that UUIDs were sortable and still contained their timestamp. When bug hunting, sometimes things become a little more obvious when there is an exact start and end to ids with issues.
There are KSUIDs that aim to satisfy this

A go ref impl: https://github.com/segmentio/ksuid

Depends on the version used. Some of them do encode time. But since people don’t like to leak information they use the random version (4).
They're little endian so not sortable
What does that have to do with anything?
>>>> My only wish is that UUIDs were sortable and still contained their timestamp. When bug hunting, sometimes things become a little more obvious when there is an exact start and end to ids with issues.

>>> Depends on the version used. Some of them do encode time.

Encoding time isn't enough, it has to be big endian (unless you write a special sorting function for uuids). Timestamped uuids store the timestamp as [timestamp_low, timestamp_mid, version(!), timestamp_high][1] which doesn't sort right.

[1] https://en.m.wikipedia.org/wiki/Universally_unique_identifie...

According to that Wikipedia page the binary representation of UUID 1 is big endian. It's the date-time and MAC address version.
You can use ULID and store it as UUID since they are the same size. You can check this article for the details:

https://blog.daveallie.com/ulid-primary-keys

UUIDv7 is sortable by time but I’m not sure if it’s possible to derive the time stamp from the UUID somehow.
The first 48 bits of uuidv7 is the number of microseconds since the epoch.
I’ve always liked the pattern of putting timestamps on any objects in my DBs.
I implemented it myself. Was a little bit tricky, but not rocket science.
Mongodbs ObjectId has this property.
Something I don't understand: how are UUIDs not safe given that they are probably better than 99.9999% of passwords generated by users?
Does your UUID library use a cryptographic safe RNG?
Java's does, and that's the implementation the article discusses.
But this is the point though, UUID is the wrong tool for the job. You want a cryptographically random blob of entropy and you reach for a UUID because it happens to contain some of that in a specific implementation.

UUIDs are for uniqueness and involve implicit trust. Cryptographic libraries are what you need to generate entropy blobs without weakening security/confusing the next developer etc.

UUIDs are nearly half the mac address of the server + a timestamp. They are in no way random.
That's UUID v1. The random one that everyone uses is v4.
I have seem some common libraries that default to v1, so I can see why there’s some confusion in here.
> Something I don't understand: how are UUIDs not safe given that they are probably better than 99.9999% of passwords generated by users?

UUIDs are 128 bits. Which is beat by a 5 character a-z random string.

It's certainly possible that they're better than the median password - especially if there isn't a check against a common password list. But it's pretty easy for user chosen passwords to be much, much better.

I strongly doubt that your 6 9s estimate is accurate.

> UUIDs are 128 bits. Which is beat by a 5 character a-z random string.

A sibling gives the actual math that shows how wrong this is, but this doesn't even pass the most rudimentary sniff test. The most common encoding for a lowercase string would be in 8 bits per character, so a 5 character string can get you at most to 40 bits.

And that's assuming you allowed every one of the 256 possible characters. You're restricting it down to 26 characters.

EDIT: I was curious, so I checked. Even if you allowed every current Unicode character, 5 characters only gets you to ~86 bits of entropy:

log2(149186^5) ~= 85.9

As for the original 6 nines claim, I also calculated the entropy for a 14 character random password that allows all 62 letters+numbers plus 8 special characters:

log2(70^14) ~= 85.8

It's not until 20 characters that it matches a UUID v4. So, yeah, I'm okay with OP's 6 nines.

128 bits are 16 bytes, which is at best a binary string of 16 characters. Remove some bits for the not random parts of the UUID and still you don't get down to 5 characters. Furthermore "a 5 character a-z random string" is less than 5 bits per character. Make them less than 6 by adding A-Z and the ten digits.

About storage, at least PostgreSQL has been using 16 bits of storage since at least version 8 many years ago.

https://www.postgresql.org/docs/current/datatype-uuid.html

https://www.jacoelho.com/blog/2021/06/postgresql-uuid-vs-tex...

A 5 character a-z random string has log2(26^5) =~ 23.5 bits of entropy, way less than 128.
The best case for a 5 ascii character password is 7 * 5 = 35 bits.
Also UUID v3 and v5 produce IDs from identifiers such as URLs which can be quite useful if you want two different systems to generate the same exact UUID given knowledge of the same URL.

For example, in a REST system that needs UUIDs I'd use the REST URL of the object as the UUID.

The best format:

{opaqueTokenTypePrefix}_{crockfordEncodedEntropy}

Also: pass token through a bad words and "credit card lookalike" filter.

Optionally encode author cluster/region details in the low order bytes to resolve before eventual consistency in active-active systems.

> If you're using them for unguessable random strings then yeah, they're not ideal.

Why? I like to use them for private/secret URLs ...