| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by derefr 2307 days ago

> If you really, really want a B-Tree (why?)

Cheap writes, in an OLTP sense?

I had a similar problem recently: discovering "everything" in a DHT (= asking each node for everything it has), and feeding the resulting documents into a message queue for processing, without wasting resources processing each object's thousands of duplicate copies found distributed in the DHT.

These objects had no natural key, so I had to use a content hash for deduplication. And there was no way to time-bound the deduplication such that I could bucket objects into generations and only deduplicate within a generation. I needed a big fat presence-set of all historical hashes; and I needed to add new hashes to it every time I found a new unique object.

B-Trees have a good balance of read- and write-performance, which is why they're used for database indices and (usually) as the lower-level persistence strategy of key-value databases.

> PPS: A pet peeve of mine is older Java database "enterprise" applications that use UCS-2 "nvarchar" text columns in databases to store GUID primary keys.

Endorsing this point and boosting it: DBMSes really haven't thought out how to efficiently store large identifiers.

For example, a DBMS could suggest that clients feed it UUIDv1s, and then break them down into separate hidden {mac:48, timestamp:60, clockseq:14} columns, enabling per-column compression. (In most installations I've encountered, there's only one client generating UUIDs for the DBMS anyway, so these would compress really well, in fact enabling a default packed format where the timestamp + 5 bits of clockseq are kept in a 1-bit-tagged uint64; and the full value-size is only needed for exceptions.)

2 comments

jiggawatts 2307 days ago

> Cheap writes, in an OLTP sense?

Sure, but HIBP is most certainly not an OLTP workload. It is updated infrequently as a batch process. The official mirror was last modified in July 2019!

Whenever a "new set" of millions of passwords are leaked, the HIBP maintainer(s) merge it with their existing data set of millions of passwords.

The "update" process is to download the new data set... and that's it. It's already sorted.

The only step I'm suggesting is to simply convert the pre-sorted HIBP SHA1 text file to a flat binary file. This takes a constant time and requires only a tiny buffer in memory.

link

jiggawatts 2306 days ago

> DBMSes really haven't thought out how to efficiently store large identifiers.

This 1000x.

I really wish SQL Server had a "hash key" type where the hash is SHA512. You could then use anything as a primary key or an index, irrespective of size.

link