Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]
I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.
It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.
I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.
In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.
I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.
I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.
Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.
If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.
If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.
If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.
Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.
I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.
This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.
The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.
postgres=# CREATE TABLE foo(id INT, bar TEXT);
CREATE TABLE
postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
INSERT 0 1
postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED
ALWAYS AS IDENTITY (START WITH 2);
ALTER TABLE
postgres=# INSERT INTO foo (bar) VALUES ('ACK');
INSERT 0 1
postgres=# TABLE foo;
id | bar
----+--------------
1 | Hello, world
2 | ACK
(2 rows)
Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.
I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.
Here are some reasons for using UUIDs; not apply to all businesses:
- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)
- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)