Hacker News new | ask | show | jobs
by lvh 2307 days ago
You can do better than binary search here with interpolation search since the data is equidistributed. (You can do it for any data with a known distribution, but it's unlikely to be worth it for any data set that isn't linearly distributed.)

Roughly, instead of picking at 1/2 every time, you pick b/256 in, where b is the current byte value. That gets you log log n search. (You can of course pick 2 bytes, or 3, or...)

1 comments

Yes, and hence a flat array data file would allow notably better performance than this B-Tree solution.

The fastest method would be to treating the 20-byte hash as a floating point number in the range 0.0 and 1.0. This would let you use Newton's iteration method to very accurately guess the approximate location in the file and then narrow it down from there. By reading small blocks of 0.5-4KB you could probably get to the required hash in just 1-2 reads, which is either optimal or very close to it.

Do clever things to simple data!

As I said in the article, I rejected binary search and similar approaches on purpose.

In general, I rejected all approaches that operates on a sorted file. That's simply because keeping the file sorted is slow, when you need to insert more data in the future. With B-tree approach, you have fast searching, as well as, inserting. That's the reason.

The HIBP database is updated as infrequently as once a year.

Sorting a handful of gigabytes was a solved problem over 20 years ago. Sorting a file using an "online" operation like row-by-row insertion in a B-Tree is necessarily slower than any offline sort. Merge sort is particularly efficient with data that won't fit on disk.

If you do need "online" operation, a real database engine provides this (and more) for practically zero effort. It's almost as if... they were designed for this purpose!

There is no restriction on using okon only for HIBP passwords. Actually, the goal is to handle hashes, no matter where they come from. User may have reasons to update db frequently.

Yes, you're right that sorting data using B-tree is slower than other methods. But I don't use it only for sorting. It's not like 'create B-tree and forget about it'. You create the tree from initial data, and if you need to insert any more hashes there, you can do it really quickly, without a need to sort everything again.

I agree again. If I'd create okon only for my home usage, I'd probably use a real database. But I didn't want to depend on anything. I wanted to create a library that can be used by any program and just works, without forcing user to install anything else. That's why I sort the data on my own, and that's why I implemented B-tree on my own. Everything is in okon's codebase.

How do you use Newton’s method here to go faster than what I described? How do you treat a 20 byte hash as a floating point number? Unless you mean by dividing part of it at a time and interpolating linearly, which sounds like what I suggested :-)
Take the first 8 bytes, treat it as a 64-bit long integer, and then convert to a 64-bit floating point number and divide by 2^64 to get a number in the range 0.0-1.0.

Multiply this by the number 'n' hashes in your file, and it'll very closely approximate its ordinal in the list, because as you said hashes are equidistributed very well.

Now multiply this ordinal by 20 and load a small slice of the file a few KB to either side of where you're aiming so that there's essentially a 100% chance of finding the hash in one "random read" operation.

If not, you now have a block of hashes that is close, but not quite what you want. You can use the floating point conversion trick to see "how far away" you are, calculate a more accurate estimate and do a second read. It's very unlikely you'd need a third read step.