Hacker News new | ask | show | jobs
by mvanaltvorst 1761 days ago
The birthday paradox and the algorithm are not related; the birthday paradox is simply the phenomenon that even though there are several orders of magnitude more hashes than images in ImageNet, it is still likely that collisions exist.

The algorithm sounds like a simple tree search algorithm. Let's consider the naive case: traverse all images, and keep a list of hashes you have already visited. For every extra image, you have to traverse all previous n hashes you have previously computed. Naively doing this check with a for loop would take O(n) time. You have to do this traversion for every image, therefore total time complexity is O(n^2).

Fortunately, there is a faster way to check whether you have found a hash before. Imagine sorting all the previous hashes and storing them in an ordered list. A smarter algorithm would check the middle of the list, and check whether this element is higher or lower than the target hash. When your own hash is higher than the middle hash, you know that if your hash is contained within the list, it is contained in the top half. In a single iteration you have halved the search space. By repeating this over and over you can figure out if your item is contained within this list in just log_2(n) steps. This is called binary search. Some of the details are more intricate (e.g. Red-Black trees [1], where you can skip the whole sorting step) but this is the gist of it.

This all sounds way more complicated than it is in practice. In practice you would simply `include <set>;` and all the tree calculations are done behind the scenes. The algorithm contained within the library is clever, but the program written by the author is probably <10 lines of code.

[1]: https://en.wikipedia.org/wiki/Red–black_tree