| Much better? I'm not sure about that, but this seems like a fun problem for exploration. Also, 49µsec sounds like a pretty easy target to beat. I used the following in q to build a data set I could play with: \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
\7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
`:hibp 1: `s#0N 20#read1 `:hibp.input
This took about an hour to download, an hour to 7z|cut|xxd, and about 40 minutes to bake. At complete, I have an on-disk artefact in kdb's native format. q)hibp:get`:hibp; / this mmaps the artefact almost instantly
q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
5
Now that's 1000 runs taking sum 5msec, or 5µsec average lookup time! It's entirely possible my MacBook Air is substantially faster than the authors' machine, but I'm also doing a lot of other shit while this is going on so whilst I believe a better benchmark is possible, I suspect even better results under better conditions, not worse.So, if binary search is fast enough, how much faster should an Interpolation search be? My understanding is that an Interpolation search will make a better initial guess than a naive binary search because it can start "closer" to the correct value but it'd only work at all if the input had an extremely even distribution so it can guess that initial starting point to reduce the search space, so let's check that first: q)count each group hibp[;0]
00| 2171182
01| 2171242
02| 2170869
03| 2168638
04| 2171675
05| 2171500
06| 2169285
07| 2171129
08| 2171463
09| 2173704
0a| 2173702
0b| 2169950
0c| 2169562
0d| 2172129
0e| 2171763
0f| 2173154
10| 2170242
11| 2168806
12| 2172306
13| 2171502
14| 2170208
15| 2167949
..
Ok that looks pretty even to me, so next I partition on the first byte to see if getting 99% closer (1-1/256) on our first guess gets us anything: q)t:(0,sums value count each group hibp[;0]) _ hibp
q)`:t 1: t
q)t:get`:t / no cheating! back to the disk!
q)\t:1000 {x~g(g:t[first x]) bin x} .Q.sha1 "1234567890"
5
And it's still 5µsec for lookup! So I'm finding it difficult to believe there's "much better" in there. Do you have some benchmarks to look at? |