Hacker News new | ask | show | jobs
by jonstewart 2307 days ago
In digital forensics we often have to do a hash set lookup as in this article. When the set is constant, you can sort it as the author did, and then perform a linear scan to determine the maximum error—i.e., how far away a hash value is from its expected location—and then use a reduced binary search/interpolation search, where the expected index is used as the midpoint and the maximum error is used to determine the window. On large hash sets of this size, the maximum error is still often measured in KB.

It’s probably not the fastest possible algorithm (though likely faster than what the author obtained), but it plays much better with memory than naive binary search and the storage format doesn’t have any overhead.

1 comments

I'm trying to implement this myself. How is the expected location computed? Is it just hash_as_int / max_sha1_hash * file_size?