| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sonium 870 days ago
	You probably should not use UUIDs to start with in your database at least not as an ID. UUIDv7 aims solve some of the issues of UUIDv4 that are even less suitable in for databases. 99% of times using BigInt for an ID is better.

7 comments

cjblomqvist 870 days ago

There are some nice features of using UUIDs rather than ints. It's been written about before, a few on the top of my head: Client side generation of ids. No risk of faulty joins (using the wrong ids to join 2 tables can never get any hits with UUIDs, it can with ints).

Those two sucks for us right now (planning to move to UUIDs).

dankebitte 870 days ago

Uniqueness aside, UUIDs for public-facing IDs also prevent enumeration attacks and leaking business information other than timestamps.

mdasen 870 days ago

> No risk of faulty joins

Wouldn't Snowflake IDs also solve that problem? A Snowflake ID will fit within a signed 64-bit int.

https://en.wikipedia.org/wiki/Snowflake_ID

The nice thing about a Snowflake ID is that you can encode it into 11 characters in base 62. If I have a UUID, I'm going to need 22 characters. Maybe that doesn't really matter given that 11 characters isn't something someone will want to be typing anyway and Snowflake IDs do require a bit of extra caution to make sure you don't get collisions (since the number you can make per second is limited to how big your sequence generation is).

j16sdiz 869 days ago

the same idea, but this is a IETF standard.

throwaway83623 869 days ago

The "faulty joins" can be solved by having a shared sequence for all tables. A bigint column should be enough for most use cases.

j16sdiz 869 days ago

shared sequence for all tables can't be parallelized

CooCooCaCha 870 days ago

Sorry but that’s terrible advice. I’ve worked on projects that started with integer ids and it caused nothing but problems.

giva 870 days ago

What kind of problems did you encounter?

welder 870 days ago

Not OP, but I can answer this:

Integers don't scale because you need a central server to keep track of the next integer in the sequence. UUIDs and other random IDs can be generated distributed. Many examples, but the first one that comes to mind is Twitter writing their own custom UUID implementation to scale tweets [0]

[0]: https://blog.twitter.com/engineering/en_us/a/2010/announcing...

sgarland 869 days ago

> Integers don't scale because you need a central server to keep track of the next integer in the sequence.

They most assuredly do scale. [0]

Also, Slack is built on MySQL + Vitess [1], the same system behind PlanetScale, which internally uses integer IDs [2].

[0]: https://www.enterprisedb.com/docs/pgd/latest/sequences/#glob...

[1]: https://slack.engineering/scaling-datastores-at-slack-with-v...

[2]: https://github.com/planetscale/discussion/discussions/366

CooCooCaCha 869 days ago

I get what you’re saying but this feels like a premature optimization that only becomes necessary at scale.

It reminds me a bit of the microservices trend. People tried to mimic big tech companies but the community slowly realized that it’s not necessary for most companies and adds a lot of complexity.

I’ve worked at a variety of companies from small to medium-large and I can’t remember a single instance where we wish we used integer ids. It’s always been the opposite where we have to work around conflicts and auto incrementing.

sgarland 869 days ago

In the same vein, distributed DBs are not required for most companies (from a technical standpoint; data locality for things like GDPR is another story). You can vertically scale _a lot_ before you even get close to the limits of a modern RDBMS. Like hundreds of thousands of QPS.

I've personally ran MySQL in RDS on a mid-level instance, nowhere near close to maxing out RAM or IOPS, and it handled 120K QPS just fine. Notably, this was with a lot of UUIDv4 PKs.

I'd wager with intelligent schema design, good queries, and careful tuning, you could surpass 1 million QPS on a single instance.

welder 869 days ago

Auto-incrementing integers mean you're always dependent on a central server. UUIDs break that dependency, so you can scale writes up to multiple databases in parallel.

If you're using MySQL maybe integer ids make sense, because it scales differently than PostgreSQL.

sgarland 869 days ago

If the DB fails to assign an ID, it's probably broken, so having an external ID won't help you.

If you're referring to not having conflicts between distributed nodes, that's a solved problem as well – distribute chunked ranges to each node of N size.

akvadrako 869 days ago

The way to solve that is giving each server it's own range of IDs.

giva 870 days ago

Yes, but with PostegreSQL (and any other SQL server I'm aware of) you already have a central server that can do that. If you have multiple SQL server this won't work obv, unless you pair it with a unique server ID.

CooCooCaCha 869 days ago

I recently worked on a data import project and because we used UUIDs I was able to generate all the ids offline. And because they’re randomly generated there was no risk of conflict.

This was nice because if the script failed half way through I could easily lookup which ids were already imported and continue where I left off.

The point is, this property of UUIDs occasionally comes in handy and it’s a life saver.

sgarland 869 days ago

    postgres=# CREATE TABLE foo(id INT, bar TEXT);
    CREATE TABLE
    postgres=# INSERT INTO foo (id, bar) VALUES (1, 'Hello, world');
    INSERT 0 1
    postgres=# ALTER TABLE foo ALTER id SET NOT NULL, ALTER id ADD GENERATED 
               ALWAYS AS IDENTITY (START WITH 2);
    ALTER TABLE
    postgres=# INSERT INTO foo (bar) VALUES ('ACK');
    INSERT 0 1
    postgres=# TABLE foo;
     id |     bar
    ----+--------------
      1 | Hello, world
      2 | ACK
    (2 rows)

lazide 870 days ago

This doesn’t really help you in this case, because the patch is to generate the UUIDs in the database?

welder 870 days ago

Now you can use PG to generate the UUIDv7 in the beginning then easily switch to generating in the client if you need in the future, but I think OP was talking about UUID vs auto-incrementing integer in general not specific to Postgres.

lazide 870 days ago

IMO it’s always been easier to generate them in the client. Every major platform has had libraries since forever.

dventimi 870 days ago

They also leak information.

brodo 870 days ago

I encountered this once: If you use integer IDs, try to scale horizontally, and do not generate the IDs in the database, you'll get in deep trouble. The solution for us was to let the DB handle ID generation.

giva 870 days ago

Yes, but the only sane way to generate integer IDs is in the database.

nevir 870 days ago

Here are some reasons for using UUIDs; not apply to all businesses:

- client-side generation (e.g. can reduce complexity when doing complex creation of data on the client side, and then some time later actually inserting it into to your db)

- sequential ids leak competitive information: https://en.wikipedia.org/wiki/German_tank_problem

- Global identification (being able to look up an unknown thing by just an id - very useful in log searching / admin dashboards / customer support tools)

int_19h 866 days ago

It's also much easier to merge data from different sources when they all use UUIDs for row identification.

itslennysfault 869 days ago

I would never advise this. I use UUIDv4 for basically everything. It adds minimal overhead to small systems and adds HUGE benefits if/when you need to scale. If you need to sort by creation date use a "created" column (or UUIDv7 if appropriate).

If your system ever becomes distributed you will sing the praises of whoever choose UUID over an int ID, and if it never becomes distributed UUID won't hurt you.

Note: this is for web systems. If it's embedded systems then the overhead starts to matter and the usefulness of UUID is probably nil.

jandrewrogers 869 days ago

It is worth mentioning that the reason UUIDv4 is strictly forbidden in some large decentralized systems is the myriad cases of collisions because the "random number" wasn't quite as random as people thought it was. Far too many cases of people not using a cryptographically strong RNG, both unwittingly or out of ignorance that they need to.

Less of an issue if you have total control of the operational environment and code base, but that is not always the case.

mort96 869 days ago

How does this happen? Are people implementing UUIDv4 themselves using rand() or equivalent? Or has widely used UUIDv4 libraries had such bugs?

jandrewrogers 869 days ago

It comes in a couple common flavors. Most commonly it is people just rolling their own implementation and using a PRNG or similar. Not every environment has a ready-made UUIDv4 implementation, and not all UUIDv4 implementations in the wild are strict. A rarer horror story I've heard a couple times is discovering that the strong RNG provided by their environment is broken in some way. Both of these cases are particularly problematic because they are difficult to detect operationally until something goes horribly wrong.

The main reason non-probabilistic UUID-like types are used for high-reliability environments is that it is easy to verify the correctness of the operational implementation. It isn't that difficult to deterministically generate globally unique keys in a distributed system unless you have extremely unusual requirements.

sgarland 869 days ago

It adds a lot of overhead at any scale, it’s just that the overhead is hidden due to the absurd speed of modern hardware.

I’ll again point out (I said this elsewhere in a post today on UUIDs) that PlanetScale uses int PKs internally. [0] That is a MASSIVE distributed system, working flawlessly with integers as keys. They absolutely can scale, it just requires more thoughtful data modeling and queries.

[0]: https://github.com/planetscale/discussion/discussions/366

samlambert 869 days ago

GitHub also uses int PKs and has over 100,000,000 users.

mbork_pl 870 days ago

One reason you might want not to use integers for stuff like user ids is that you may leak the information about the magnitude of your userbase.

kkzz99 870 days ago

Huh, why would using uuidv4 be a problem? Collisions?

davydog187 870 days ago

My understanding is that they cause a lot of page fragmentation, which leads to excessive writes to the WAL

brodo 870 days ago

You can't use them as cursors, because they are not inherently ordered like integer ids.

kiitos 869 days ago

IDs are one thing, cursor-able fields/columns are a different thing.

You cursor on timestamps, or serial numbers, or etc., not IDs.

dventimi 870 days ago

I have never wanted to use database cursors and I predict that I never will.

brodo 870 days ago

Sure, if you don't offer pagination or only have small tables, you can get away with offsets. I tend to go for cursors as a default because I like to build applications with performance in mind and it’s the same effort.

dventimi 869 days ago

We may be talking about different things. I thought you were referring specifically to [database cursors](https://en.wikipedia.org/wiki/Cursor_(databases)) so that's what I was talking about. If you're talking about something else, like the concept of so-called "cursor-based pagination" in general, then that is still an option even with even randomly-generated primary keys, so long as there are other attributes that can be used to establish an order (which attributes need not be visible to the a user)

dventimi 869 days ago

I have offered pagination over large tables, without database cursors or non-random keys, without offsets, while keeping performance in mind, with little effort.

mariusor 870 days ago

How about the rest of us?

I don't have a resource of the top of my head to present to you, but in the least keyset pagination is superior to the offset one because it does not get invalidated by new inserts.

dventimi 869 days ago

The rest of you also don't need database cursors or non-random keys to have keyset pagination.

mrits 870 days ago

When working with distributed systems or compiling a IDs from different systems it is helpful to make sure the ID is unique

akvadrako 869 days ago

That's especially true if you care about performance and do a lot of joins, the hit can be over 10%.