Hacker News new | ask | show | jobs
by magicpointer 1838 days ago
About UUID as Primary Key and performance, the following article has some insights and benchmarks as well: https://www.2ndquadrant.com/en/blog/sequential-uuid-generato...

Essentially, they observed sizeable performance improvements by using UUID generators that are tweaked to get more sequentia resultsl. It results in better indexes. The articles compares sequences, random UUIDs and 2 kinds of sequentialish UUID generators.

4 comments

Mentioned this in a sibling comment:

There's another benefit to UUID - You can generate them anywhere including application side. Doing this on application side would have tremendous batching benefits or inserting objects with relationships at the same time (Vs waiting first insert to return an ID to be used in the FK).

Caveat programmer: this could be problematic, not in the sense it doesn't work, but in the sense that someone working on backend code may have a preconceived expectation that UUIDs are also effectively a keyspace i.e. they're hard to guess. The validity of that is already challenged by variants defining temporal or logical order, and evaporates completely if you let clients declare their own that you accept at face value. Applications may have potentially guessable/gameable object identifiers sloshing around inside as a consequence, which is modestly ironic given that one benefit many folks expect from adopting UUIDs in the first place is hardening up the attack surface of trivially enumerable sequences.

There are a few mitigations but my favourite is the "casino chips" approach: pregenerate them server side, and allocate to clients on demand, including en masse if need be ("here kid, have a few million UUIDs to get you started"). Verify with whatever simple signature scheme comes with your application server framework, or at small scale just toss them in a crude LRU store.

Or, remember where the UUID came from, and apply their organisational scope to any trust you place upon it. This might work particularly for multi-tenanted SaaS. However it requires that all usage is tenant-bounded end-through-end throughout your application. This may be in conflict with a) your framework, b) your zenlike contemplation of simplicity in data management, or c) programmers in a hurry forgetting to scope their queries properly.

Ultimately, relying on UUIDs as intrinsically unguessable security tokens is probably not a great idea, but it's one that remains thoroughly embedded in the programming zeitgeist. As ever, nothing useful comes without a compromise. Keep your eyes open to the systemic consequences of design choices, and don't leave traps for your fellow developers.

He's not saying clients can create their own ids. The applications can.

The concepts he's talking about are required for cqrs. Which is a popular concept applied with mostly DDD or microservices.

There definitely are people out there in this thread proposing clients be able provide UUIDs. I’ve seen it elsewhere too.

I’ve also personally experienced UUID collisions due to badly set up VM environments under Windows. It isn’t a good idea to blindly trust any value - and that includes supposedly ‘never collide’ id’s like UUID.

For what it’s worth, I also had the joy of debugging someone’s distributed hash table that was using md5 as the hash bucket key (this was... 2 decades ago?) and had no way to handle collisions because obviously that is impossible.

This seems more an issue of the libraries random generator to form uuids.

Eg. I use guids ( .net) and i have never seen an issue.

I getcha, but these days the ambit reach of "application" extends to Javascript executing client-side in an environment that's basically overrun with lions/tigers/bears, and I'll suggest that's particularly a consideration when the front-end is a SPA participating in a CQRS/event-sourced overall application architecture.
For perspective, the npm uuid package is now being downloaded ~50M/week. It's usage is ubiquitous at this point, on any platform JS is running.

https://www.npmjs.com/browse/depended/uuid

Little bit later to reply.

Unfortunately that doesn't mean much.

Since nodejs is a server side language and can handle that package too. And it's not "solely" for js/spa's.

Should you ever use a plain token (where you just check if it exists in some authed_users table) vs, I dunno, some sort of signed/HMAC type thing, where you have to call some function on it? I genuinely don't know but I know enough to generally leave authentication up to those that do know.

Maybe I'm just thinking of OAuth where there are multiple hops involved?

The comparison to OAuth is quite reasonable. Perhaps the most obvious parallel is the use of a state parameter during the three-legged exchange, without which it's exposed to a CSRF clickjacking attack.
Right. Maybe it's paranoid, but it seems like a bearer token has potential avenues for forgery (CSRF or others), replay attacks, add-on jacking, etc. Also harder to coordinate with distributed apps. I think the Captain Tightpants approach would be to initialize some client-side private key/cert, use that to sign each request and verify based on that cert.

That should also make it easier to, say, verify and unwrap the request at the gateway to the server, before sending it to the rest of the application-proper.

In Java there is a UUID generator based on SecureRandom. That's about as unguessable as you're going to get.
It's not a question of whether UUIDs can be generated unguessably. They can be, as you point out.

It's whether the UUIDs in your system can be reliably presumed to be unguessable - including the UUIDs that were generated by code which was written after you wrote your query that assumes unguessability.

Today you might say "Oh, this SecureRandom-based UUID generator is unguessable and meets all of our requirements". Tomorrow you might say "Ah, this SecureRandom-based UUID generator is too slow, let's generate them in our Android app instead of on the server". But now the UUIDs stored in your database aren't reliably unguessable, because you accept whatever your client API tells you without verification. How plausible is it, within the timeframes you actually get, to review every query for whether it assumes the trustworthiness of the UUID generation? Better to assume UUIDs have some convenient properties, than to assume that they're unguessable just because the API is cryptographically secure today.

I think you can probably get the same "batching benefits" if you use a global ID generation service of some sort, with sequential IDs to improve indexing. Using sequential IDs doesn't necessitate using auto-generated sequential IDs.
correct, but global service need local caching. Some "HiLo" type identity generation could work. Sequential-UUID simplifies it as you don't need that service, so it's preferable.

HiLo explanation below:

A client simply gets a range of ids to be used, exclusive to them. Then can use it without any roundtrip to server.

This can be achieved with a "hi" stored on service, and a batchSize that's constant in the system.

Each time a "hi" is requested, it's incremented by 1. Now the client can generate (hi * batchsize, (hi+1) * batchsize - 1).

The Postgres JDBC driver does not guarantee that batch inserts come back in the same order that you insert them (when you use RETURNING *). So, if you generate UUIDs server-side, you can't conveniently match them up with the records you just inserted. You're better off generating them in the app server first and then sending them to Postgres.
Plus, it's way better to generate a UUID in the app server and send it to the server as a string. The reason is that you can deal with the id as a plain string and don't have to deal with a non-standard datatype in your app.
Another benefit to UUIDs is merging tables/databases without key conflict.
Another benefit is that clients can generate and store objects before sending them to the server.

Allowing simple caching, easy async, easy handling of offline, or far simpler clientside code for dealing with those objects. etc.

For mobile app development which relies on an online (http) backend, clientside generatable UUIDs offer almost only benefits.

There's nothing that prevents you from fetching a batch of N IDs from the server. Server just does +N on the sequence and you can do whatever you like with the IDs client side.

You can also use one sequence for everything on the server, and then you can also pre-create ID based relationships client side.

You can just use PostgreSQL's writeable CTEs to get the same batching benefits plus the benefits from using serials. So, no, I do not think batching is a good reason for using UUIDs.
Writable ctes sound like a very niche feature I haven't used and don't plan to use. It feels like a good reason to me.
I use a ulid[1] as a uuidv4 replacement:

https://github.com/ulid/spec

They almost got it right, a better implementation would overflow regularly to make use of the entire key space, and counter untuitively more resistant to overflows.

Clocks aren't reliable enough for timestamps anyways so garbage collection is the only thing you kinda wanna rely on them for.

A good sweet spot seems to be, 32bit milliseconds + 96bit of entropy. This overflows appeoximately every 50 days, allowing for 50 day rolling data retention.

Not the worst idea—50 days is a nice sweet spot between infrequent enough to have some indexing benefit and frequent enough that potential downsides will be discovered early in the product’s life cycle.

Personally I wouldn’t do this. A scenario where for each individual millisecond of elapsed time, 96 bits of entropy is an upgrade over 80 bits of entropy, is fairly extreme. I don't think there are many databases in the world which would ever need more collision mitigation than that.

> I don't think there are many databases in the world which would ever need more collision mitigation than that.

Individual instances? Maybe not. But for those an autoincrement key would also work. That is not the scenario that ULIDs and GUIDs are advertised for.

The goal is to have an universally/globally unique ID. So whenever you encounter two IDs you can be (resonably, probability wise) sure that they won't collide.

Any such sheme thus must, by definition, serve every single use case now and forever everywhere. That's a tough one.

Also it's not really 80bit vs 96bit (which due to the birthday paradox is already a huge difference) but more 80bits vs. 128bit as the timestamp is recycled with sufficient usage.

I'm actually concerned that 96bit isn't enough, as it relies on the assumption that you'll use this scheme for for data spanning years, in order to properly use the timestamp as entropy.

I was debating between using ULID vs just using the same sequential bigint sharding strategy that Instagram uses[1][2].

I ended up deciding to use sharded bigints because it enables better indexing, even though there are drawbacks when first writing the instances; The benefit of faster search was more important for me.

[1]: https://instagram-engineering.com/sharding-ids-at-instagram-...

[2]: http://www.livinginthepast.org/2016/02/24/sharding-into-bigi...

Would you recommend this over https://github.com/ericelliott/cuid ? I use cuid with Postgres and it’s worked out great.
They seem like what I'm looking for. I was looking at moving to something more random because our current UUIDs (uuid1) are too easy to mistake for one another at a glance.
I was hoping someone would mention these. Its really quite nice. 128bit, sortable, client or server generated, no collision (well almost zero: 80bits are random per millisecond)
Careful with leaking sensitive information with semi-sequential UUIDs though.
To alleviate the issue of having a sequential part, they make it wrap around so that you cannot tell the order between two UUIDs. It's already some protection, and the random part is still large.