| This and other similar solutions seem awfully over-engineered. a) Hashes are constant size (20 bytes for SHA1) b) You only care if they're present or not in the database. There's no associated variable length data. The simplest yet very efficient format is simply a sorted array of "byte[20]", with binary-search as the lookup. No headers, no pointers, no custom format of any type. Literally just 20 x n bytes where 'n' is the number of hashes. Lookup of the 'n-th' hash is just multiplying 'n' by 20. If you really, really want a B-Tree (why?), just stuff it in any database engine. Literally anything will handle a single fixed-length key lookup efficiently for you. CREATE TABLE "HIBP" ( "SHA1" BINARY(20) PRIMARY KEY );
SELECT 1 FROM "HIBP"
WHERE "SHA1" = 0x70CCD9007338D6D81DD3B6271621B9CF9A97EA00
There. I solved the blogger's problem in literally under 5 minutes without having to write a custom binary.You can trivially query databases from both web apps and CLI tools, and you can do this with batch queries too. E.g.: WHERE "SHA1" IN (... list... ) PS: Text-based tools (such as most shell tools) suck at this type of binary data. The newline terminated hex representation is 41 bytes per hash, so just over double the required size. Clever 5-10% prefix compression tricks pale in comparison to not doubling the data size to begin with. PPS: A pet peeve of mine is older Java database "enterprise" applications that use UCS-2 "nvarchar" text columns in databases to store GUID primary keys. The 16 byte GUID ends up taking a whopping 78 bytes to store! |
Roughly, instead of picking at 1/2 every time, you pick b/256 in, where b is the current byte value. That gets you log log n search. (You can of course pick 2 bytes, or 3, or...)