|
|
|
|
|
by lalaland1125
2307 days ago
|
|
You can actually do much better than binary search due to the uniform distribution of hashes. https://en.wikipedia.org/wiki/Interpolation_search for example can achieve O(log log n) performance under the uniform distribution assumption which is order of magnitudes faster for this scale of data. Another trick is to start with interpoluation search and then switch to binary search once the sample size gets small enough. |
|
I'm not sure about that, but this seems like a fun problem for exploration. Also, 49µsec sounds like a pretty easy target to beat.
I used the following in q to build a data set I could play with:
This took about an hour to download, an hour to 7z|cut|xxd, and about 40 minutes to bake. At complete, I have an on-disk artefact in kdb's native format. Now that's 1000 runs taking sum 5msec, or 5µsec average lookup time! It's entirely possible my MacBook Air is substantially faster than the authors' machine, but I'm also doing a lot of other shit while this is going on so whilst I believe a better benchmark is possible, I suspect even better results under better conditions, not worse.So, if binary search is fast enough, how much faster should an Interpolation search be? My understanding is that an Interpolation search will make a better initial guess than a naive binary search because it can start "closer" to the correct value but it'd only work at all if the input had an extremely even distribution so it can guess that initial starting point to reduce the search space, so let's check that first:
Ok that looks pretty even to me, so next I partition on the first byte to see if getting 99% closer (1-1/256) on our first guess gets us anything: And it's still 5µsec for lookup! So I'm finding it difficult to believe there's "much better" in there. Do you have some benchmarks to look at?