Hacker News new | ask | show | jobs
by kardos 535 days ago
So why are they 3.27x slower to insert? Are they 3.27x longer in string form?
1 comments

It's likely a function of the fact that `gen_random_uuid()` is implemented in C [0], and is essentially just reading from `/dev/urandom`, then modifying the variant and version bits. Whereas, assuming they're using something like what was described here [1], that's a lot of function calls within Postgres, which slows it down.

As an example, this small function that makes UUIDv4:

    postgres=# CREATE OR REPLACE FUNCTION custom_uuid_v4() RETURNS uuid AS $$
        SELECT encode(set_byte(set_byte(gen_random_bytes(16), 6, (get_byte(gen_random_bytes(1), 0) & 15) | 64), 8, (get_byte(gen_random_bytes(1), 0) & 63) | 128), 'hex')::uuid;
    $$ LANGUAGE sql;
Took 14.5 seconds to create / insert 1,000,000 rows into a temp table, compared to 7.1 seconds for `gen_random_uuid()`.

[0]: https://doxygen.postgresql.org/uuid_8c.html#a6296fbc32909d10...

[1]: https://blog.daveallie.com/ulid-primary-keys/

I don't think that's right. They show in the section titled "Generating" that the performance of calling the ULID function from SQL is only very slightly slower. It's the INSERT that performs worse.

Generally, inserting sorted values (like sequential integers or in this case, ULIDs) into a B-tree index is much faster than inserting random values. This is because inserted values go into the same, highly packed B-tree nodes, whereas random inserts will need to create a lot of scattered B-tree nodes, resulting in more pages written. Random values are generally faster to query, but slower to insert.

In this case I think the insert speed differences may come down to the sizes of the keys. Postgres's native UUID type is 128 bits, or 16 bytes, whereas the ULID is stored as the "text" type, encoded as base32, resulting in a string that is 26 bytes, plus a 32-bit string length header, so 240 bits in total, or 1.87x longer. In the benchmark, the ULID insert is about 3x that of the UUID. So the overhead may be not just the extra space but the overhead of string comparisons compared to just comparing 128-bit ints.

Edit: The article doesn't actually say which ULID implementation they use. The one implemented in PL/PGSQL mentioned in one of the article's links [1] is very slow. The other [2] is quite fast, but doesn't use base32. However, this [3] native C extension is fast, about 15% faster than the UUID function on my machine.

On my machine, using pg-ulid, inserting 1M rows was on average 1.2x faster for UUID than ULID (mean: 963ms vs 1131ms). This is probably all I/O, and reflects the fact that the ULIDs are longer. Raw output here: https://gist.github.com/atombender/7adccb17a95056313d0e8ff56....

Edit 2: They don't have an index on the column in the article, so my comment about B-tree performance doesn't apply here.

[1] https://blog.lawrencejones.dev/ulid

[2] https://blog.daveallie.com/ulid-primary-keys/

[3] https://github.com/andrielfn/pg-ulid

I assumed that they were storing the ULIDs as binary, in the UUID column type, as link 2 in your reply. If stored as TEXT, then yes, that absolutely would make a difference.

It’s also worth noting that unlike MySQL / SQL Server, Postgres does not store tuples clustered around the PK. Indices are of course still in a B+tree.

They show that they're storing the ULIDs as text. Quoting from the article:

   CREATE TABLE ulid_test(id TEXT);
I suspect their poor results come from their choice of ULID implementation. The native C implementation I tried out is faster than the Postgres UUID type when testing computation only.

I noticed a bug in their test: They call generate_ulid() with now(). But now() is an alias for transaction_timestamp(), which is computed once at the start of the transaction, so all the timestamps will be the same. They should be using clock_timestamp().

Good catch to both.