Hacker News new | ask | show | jobs
by lalaland1125 2307 days ago
You can actually do much better than binary search due to the uniform distribution of hashes.

https://en.wikipedia.org/wiki/Interpolation_search for example can achieve O(log log n) performance under the uniform distribution assumption which is order of magnitudes faster for this scale of data. Another trick is to start with interpoluation search and then switch to binary search once the sample size gets small enough.

2 comments

Much better?

I'm not sure about that, but this seems like a fun problem for exploration. Also, 49µsec sounds like a pretty easy target to beat.

I used the following in q to build a data set I could play with:

    \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
    \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
    `:hibp 1: `s#0N 20#read1 `:hibp.input
This took about an hour to download, an hour to 7z|cut|xxd, and about 40 minutes to bake. At complete, I have an on-disk artefact in kdb's native format.

    q)hibp:get`:hibp; / this mmaps the artefact almost instantly
    q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
    5
Now that's 1000 runs taking sum 5msec, or 5µsec average lookup time! It's entirely possible my MacBook Air is substantially faster than the authors' machine, but I'm also doing a lot of other shit while this is going on so whilst I believe a better benchmark is possible, I suspect even better results under better conditions, not worse.

So, if binary search is fast enough, how much faster should an Interpolation search be? My understanding is that an Interpolation search will make a better initial guess than a naive binary search because it can start "closer" to the correct value but it'd only work at all if the input had an extremely even distribution so it can guess that initial starting point to reduce the search space, so let's check that first:

    q)count each group hibp[;0]
    00| 2171182
    01| 2171242
    02| 2170869
    03| 2168638
    04| 2171675
    05| 2171500
    06| 2169285
    07| 2171129
    08| 2171463
    09| 2173704
    0a| 2173702
    0b| 2169950
    0c| 2169562
    0d| 2172129
    0e| 2171763
    0f| 2173154
    10| 2170242
    11| 2168806
    12| 2172306
    13| 2171502
    14| 2170208
    15| 2167949
    ..
Ok that looks pretty even to me, so next I partition on the first byte to see if getting 99% closer (1-1/256) on our first guess gets us anything:

    q)t:(0,sums value count each group hibp[;0]) _ hibp
    q)`:t 1: t
    q)t:get`:t / no cheating! back to the disk!
    q)\t:1000 {x~g(g:t[first x]) bin x} .Q.sha1 "1234567890"
    5
And it's still 5µsec for lookup! So I'm finding it difficult to believe there's "much better" in there. Do you have some benchmarks to look at?
If you're going to sort the hashes then might as well make a jump table of the first n bits and a binary search from there.
Why do you even need the jump table? If the hash function is working you should get quite close just by dead reckoning.