Hacker News new | ask | show | jobs
by jiggawatts 2307 days ago
This and other similar solutions seem awfully over-engineered.

a) Hashes are constant size (20 bytes for SHA1)

b) You only care if they're present or not in the database. There's no associated variable length data.

The simplest yet very efficient format is simply a sorted array of "byte[20]", with binary-search as the lookup. No headers, no pointers, no custom format of any type. Literally just 20 x n bytes where 'n' is the number of hashes. Lookup of the 'n-th' hash is just multiplying 'n' by 20.

If you really, really want a B-Tree (why?), just stuff it in any database engine. Literally anything will handle a single fixed-length key lookup efficiently for you.

    CREATE TABLE "HIBP" ( "SHA1" BINARY(20) PRIMARY KEY );
    SELECT 1 FROM "HIBP"
    WHERE "SHA1" = 0x70CCD9007338D6D81DD3B6271621B9CF9A97EA00
There. I solved the blogger's problem in literally under 5 minutes without having to write a custom binary.

You can trivially query databases from both web apps and CLI tools, and you can do this with batch queries too. E.g.: WHERE "SHA1" IN (... list... )

PS: Text-based tools (such as most shell tools) suck at this type of binary data. The newline terminated hex representation is 41 bytes per hash, so just over double the required size. Clever 5-10% prefix compression tricks pale in comparison to not doubling the data size to begin with.

PPS: A pet peeve of mine is older Java database "enterprise" applications that use UCS-2 "nvarchar" text columns in databases to store GUID primary keys. The 16 byte GUID ends up taking a whopping 78 bytes to store!

7 comments

You can do better than binary search here with interpolation search since the data is equidistributed. (You can do it for any data with a known distribution, but it's unlikely to be worth it for any data set that isn't linearly distributed.)

Roughly, instead of picking at 1/2 every time, you pick b/256 in, where b is the current byte value. That gets you log log n search. (You can of course pick 2 bytes, or 3, or...)

Yes, and hence a flat array data file would allow notably better performance than this B-Tree solution.

The fastest method would be to treating the 20-byte hash as a floating point number in the range 0.0 and 1.0. This would let you use Newton's iteration method to very accurately guess the approximate location in the file and then narrow it down from there. By reading small blocks of 0.5-4KB you could probably get to the required hash in just 1-2 reads, which is either optimal or very close to it.

Do clever things to simple data!

As I said in the article, I rejected binary search and similar approaches on purpose.

In general, I rejected all approaches that operates on a sorted file. That's simply because keeping the file sorted is slow, when you need to insert more data in the future. With B-tree approach, you have fast searching, as well as, inserting. That's the reason.

The HIBP database is updated as infrequently as once a year.

Sorting a handful of gigabytes was a solved problem over 20 years ago. Sorting a file using an "online" operation like row-by-row insertion in a B-Tree is necessarily slower than any offline sort. Merge sort is particularly efficient with data that won't fit on disk.

If you do need "online" operation, a real database engine provides this (and more) for practically zero effort. It's almost as if... they were designed for this purpose!

There is no restriction on using okon only for HIBP passwords. Actually, the goal is to handle hashes, no matter where they come from. User may have reasons to update db frequently.

Yes, you're right that sorting data using B-tree is slower than other methods. But I don't use it only for sorting. It's not like 'create B-tree and forget about it'. You create the tree from initial data, and if you need to insert any more hashes there, you can do it really quickly, without a need to sort everything again.

I agree again. If I'd create okon only for my home usage, I'd probably use a real database. But I didn't want to depend on anything. I wanted to create a library that can be used by any program and just works, without forcing user to install anything else. That's why I sort the data on my own, and that's why I implemented B-tree on my own. Everything is in okon's codebase.

How do you use Newton’s method here to go faster than what I described? How do you treat a 20 byte hash as a floating point number? Unless you mean by dividing part of it at a time and interpolating linearly, which sounds like what I suggested :-)
Take the first 8 bytes, treat it as a 64-bit long integer, and then convert to a 64-bit floating point number and divide by 2^64 to get a number in the range 0.0-1.0.

Multiply this by the number 'n' hashes in your file, and it'll very closely approximate its ordinal in the list, because as you said hashes are equidistributed very well.

Now multiply this ordinal by 20 and load a small slice of the file a few KB to either side of where you're aiming so that there's essentially a 100% chance of finding the hash in one "random read" operation.

If not, you now have a block of hashes that is close, but not quite what you want. You can use the floating point conversion trick to see "how far away" you are, calculate a more accurate estimate and do a second read. It's very unlikely you'd need a third read step.

Yep! And here's a mini guide I wrote to building a SQLite DB that queries in sub-millisecond time:

https://gist.github.com/timmc/df5fbb6e069fb8c1c4e181a29930ac...

I didn't bother with using binary keys, and just used hex, because I couldn't be arsed to do something more than `.mode csv` and `.import hashes.lst hashes` at the time, but it would be easy to write a tiny Python program to halve the space, as you say.

1. The raw array solution doesn't work if you can't fit the entire thing in memory. You could try mmap'ing the file and letting the kernel do the paging for you, I guess.

2. There's nothing over engineered about a B-Tree. It's a simple, well understood data structure which solves this exact problem.

3. There are meaningful differences in what you can do with a solution that gives you answers in microseconds vs milliseconds, as the blogger mentions.

> The raw array solution doesn't work if you can't fit the entire thing in memory.

Of course it works. You don't have to literally load the whole file into memory as an array. That's not how file I/O works! Just fseek() to the desired multiple-of-20 address and read 20 bytes.

> You could try mmap'ing the file

You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. They started on 64-bit machines and never noticed that mmap is limited to a ~256MB window on most 32-bit architectures. Certainly less than the 11GB needed for this problem!

> There's nothing over engineered about a B-Tree.

Yes there is. This guy hand-rolled 400+ lines of low-level data structure manipulation code just for the B-Tree manipulation. How much do you want to bet that he's got zero memory safety issues or other unsafety in that thing?

Would you link this to your web app? Really, a C++ library?

> That gives you answers in microseconds vs milliseconds

I don't want to call him a liar... but now I have to. At the very least, he's made fatal mistakes in his benchmarking.

There is no way that he can do random password lookups from on-disk structures in under 50 microseconds, unless he's got an Intel Optane SSD, and even then I would be highly suspicious.

He's either cached the entire data structure in memory (at which point why bother with such a complex solution), or he's used the same set of passwords over and over for testing (which will falsely show a much lower latency than you would get with real passwords).

So answer me this: How would this solution work on a typical auto-scale cluster of stateless web server VMs? Do you replicate 5-12GB to every VM every time this database changes? Or do you put one copy on a network share? Congratulations, you now have 1ms latency minimum! That fancy microsecond optimisation has just gone out the window...

To be blunt: If your password-testing is the bottleneck to the point that a 50 microsecond latency is needed, then you're running a service that gets 20,000 password resets per second. At this point you're bigger than Office 365 and G Suite... combined, and you likely have much more important scaling concerns.

That's exacly why I publish my pet projects. To learn from others.

You're right. The benchmarks are bad. I benchmarked the best and the worst case in scope of the original file, so I look up for the first and the last hash.

I totally missed that if I look for the same hash over and over again, I'd end up reading the same B-tree files parts, so they can be easily cached. That's probably the reason it seemed so fast.

No lies here, I just missed this (:

I'll rewrite benchmarks and update the results.

> You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did. They started on 64-bit machines and never noticed that mmap is limited to a ~256MB window on most 32-bit architectures.

Do you have a reference for that? I've never heard of this and I can't imagine the reason for such a limitation.

Having done a lot with mmap on 32bit systems I can't remember such a technical limitation either..

..although in practive it might be hard to find 256MB of contiguous memory in a 4GB virtual memory space due to fragmentation of other allocations and shared libraries, especially with ASLR.

https://stackoverflow.com/questions/5518084/memorymappedfile...

There's the hard limit of 2GB for most versions of 32-bit Windows and 4GB for any operating system.

Couple that with the requirement for a contiguous address space as well as various page table entry (PTE) limits, you get all sorts of "soft" limits way before 2GB. From what I've heard, 256MB is relatively safe to map, but anything much larger than that is increasingly likely to fail.

Correctly written code should be able to work with moveable "windows" into the file as small as 32MB to be properly robust, especially if the process memory is already fragmented.

Lots of software crashes with large files on 32-bit machines because of this. E.g.: https://www.monetdb.org/pipermail/users-list/2009-January/00...

As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust.

Even on 64-bit platforms you can run into trouble. For example: https://jira.mongodb.org/browse/SERVER-15070

In that example, Windows Server 2008 R2 has an 8 TB limit. You could hit that if using a tool like ripgrep to do "forensic analysis" of disk images from a SAN, where virtual disks typically have 16 TB limit. So if you mount a SAN snapshot and open the disk as a file to scan it, you will hit this limit!

Programmers make all sorts of invalid assumptions...

ripgrep doesn't require memory maps, and if they fail to open, it will fall back to a more traditional buffering strategy: https://github.com/BurntSushi/ripgrep/blob/50d2047ae2c0ce2ed...

ripgrep has always had a fast traditional buffering strategy using `read` calls for searching, because I knew that mmap couldn't be used in every case.

Anyway, this has been fixed for a couple years at this point, so if you're still experiencing a problem, then please file a new bug report.

> As a more recent example, ripgrep had issues on 32-bit platforms because of a bug in the way the underlying mmap library worked in Rust.

This is false. The bug you're thinking about is probably https://github.com/BurntSushi/ripgrep/issues/922, which was not caused by an underlying bug in memmap. memmap did have an underlying bug with respect to file offsets, but ripgrep did not use the file offset API. The bug was caused in ripgrep itself, since I made the classic mistake of trying to predict whether an mmap call would fail instead of just trying mmap itself. That bug was fixed on master before the Windows bug was even reported: https://github.com/BurntSushi/ripgrep/commit/93943793c314e05...

> You'd only try that if you haven't read the documentation for mmap, just like a bunch of Rust programmers did.

This isn't exclusive to Rust programmers. C tools make the same mistake all the time. Because memory maps aren't just problematic with large files on 32-bit systems, but they also don't work with virtual files on Linux. Try, for example, `ag MHz /proc/cpuinfo` and see what you get. Crazy how, you know, sometimes humans make mistakes even if they are a C programmer!

And the implication that I (or the author of memmap) never read the docs for `mmap` is just absurd.

If you're going to be snooty about stuff like this, then at least get the story correct. Or better yet, don't be snooty at all.

We've spoken before about this issue and at the time ripgrep was just erroring on large files on 32-bit platforms, it didn't fall back. You were using the Rust crate "mmap" at the time, you removed it temporarily as a fix, and now you're using the much improved "memmap" crate. Good stuff! I do use your tool occasionally, and it's useful, albeit the CPU fan noises annoy my co-workers.

The specific issue making the "mmap" crate incorrect was that it used a "usize" instead of "u64" for some of the functions, limiting it to 4GB files on 32-bit platforms. I believe it's this line of code: https://github.com/rbranson/rust-mmap/blob/f973ae1969b4b7e80...

Now, I'm not a mindreader, but to me this feels an awful lot like its author made a tacit assumption that mmap() is a "memory operation" that is tied to the architecture's pointer size. In similar conversations, heck, in this very discussion people were incredulous that a file can be bigger than memory and be processed.

I absolutely believe that people do not read much past the function declarations, and it might be a "snooty attitude" but experience unfortunately has shown it to be an accurate attitude.

I'm also not accusing you of incorrectly using mmap(), buuuuut... having a quick flip through your current code I see that you still have the attitude that "mmap() takes a filename and makes into a slice that the kernel magically reads in for me on demand".

This is just not true, not even on 64-bit platforms. On smaller devices with only 2-4GB of memory, it's entirely possible to simply run out of page table entries (PTEs). It's possible the memory space simply gets too fragmented. It's possible the kernel has other limits for processes. It's possible the that file is some virtual device with an enormous reported size. Etc, etc, etc...

The correct usage of mmap() is to use moderately-sized sliding windows of, say, 128MB at a time or whatever.

But, having said that: Your code is now correct in the sense that it won't crash, it won't have unsafety, it'll run on 32-bit just fine, and will probably work for all practical scenarios that people want to use a grep tool for. I also know that you have specific optimisations for "the whole file fits in a byte slice", so there's benefits to using the simple approach instead of a sliding window.

However, if this was a database engine that required mmap() to work, it would be absolutely incorrect. But it isn't a database engine, so no big deal...

> never noticed that mmap is limited to a ~256MB window on most 32-bit architectures

That's interesting, but processing more than 4gb of data on a 32bit systems seems pretty niche, these days? Where do you find them outside of industrial embedded applications? I think even my long discarded smart watch ran 64bit.

Now, vfat filesystems will bite you (Esp for removable media), but that's also fixable with some other fs, like xfat or zfs.

> If you really, really want a B-Tree (why?)

Cheap writes, in an OLTP sense?

I had a similar problem recently: discovering "everything" in a DHT (= asking each node for everything it has), and feeding the resulting documents into a message queue for processing, without wasting resources processing each object's thousands of duplicate copies found distributed in the DHT.

These objects had no natural key, so I had to use a content hash for deduplication. And there was no way to time-bound the deduplication such that I could bucket objects into generations and only deduplicate within a generation. I needed a big fat presence-set of all historical hashes; and I needed to add new hashes to it every time I found a new unique object.

B-Trees have a good balance of read- and write-performance, which is why they're used for database indices and (usually) as the lower-level persistence strategy of key-value databases.

> PPS: A pet peeve of mine is older Java database "enterprise" applications that use UCS-2 "nvarchar" text columns in databases to store GUID primary keys.

Endorsing this point and boosting it: DBMSes really haven't thought out how to efficiently store large identifiers.

For example, a DBMS could suggest that clients feed it UUIDv1s, and then break them down into separate hidden {mac:48, timestamp:60, clockseq:14} columns, enabling per-column compression. (In most installations I've encountered, there's only one client generating UUIDs for the DBMS anyway, so these would compress really well, in fact enabling a default packed format where the timestamp + 5 bits of clockseq are kept in a 1-bit-tagged uint64; and the full value-size is only needed for exceptions.)

> Cheap writes, in an OLTP sense?

Sure, but HIBP is most certainly not an OLTP workload. It is updated infrequently as a batch process. The official mirror was last modified in July 2019!

Whenever a "new set" of millions of passwords are leaked, the HIBP maintainer(s) merge it with their existing data set of millions of passwords.

The "update" process is to download the new data set... and that's it. It's already sorted.

The only step I'm suggesting is to simply convert the pre-sorted HIBP SHA1 text file to a flat binary file. This takes a constant time and requires only a tiny buffer in memory.

> DBMSes really haven't thought out how to efficiently store large identifiers.

This 1000x.

I really wish SQL Server had a "hash key" type where the hash is SHA512. You could then use anything as a primary key or an index, irrespective of size.

I don't think that expecting that an end user has installed a whole database toolchain is a good solution.

The purpose of the tool and the library is to be a complete solution for this one thing. The user can grab the binary and just use it, without installing/using any database under the hood.

Not sure if you've ever tried to load such a large dataset into any rdbms, but it takes hell of a long time to create the index.
If you stripe a few NVME SSDs together, it'll help quite a bit.
Yet you pay the cost in speed when using a database engine, there is a big overhead.
Meh.

This query is just a key lookup, so the response time is likely to be well under 10ms even if you're not trying very hard to be efficient.

If using a stored procedure or a prepared query with persistent connections it'll be most likely be under 1ms if the data is stored on SSDs or in-memory. In this binary format, you need about 6 GB.

For some scenarios such as small cloud web servers, fast storage or lots of ram may not be viable. So if you're storing this on mechanical drives, the random nature of hashes means that for practically all queries the db engine will be walking the B-Tree pages and then the random I/O seek latency will dominate.

Some quick back-of-the-envelope maths: You can fit roughly 300 hashes into a typical 8KB page, as used by MS SQL Server as a random example. That means it'll build a 4-level B-Tree index for the HIBP database.

Unless you have less than 1 GB of memory, the first 2 index levels will become cached, leaving 2 random seeks for each lookup. At a typical 3-5ms, this is about 6-10ms of disk I/O latency, which will likely dwarf all other overheads.

Now keep in mind that the OP was trying to optimise a workflow that took 33 seconds originally!

Using a proper database has a ton of other benefits. For example, it becomes trivial to fit into modern asynchronous web application programming frameworks as yet another async call. Literally 1 line of code using Dapper or something along those lines.