Hacker News new | ask | show | jobs
by dekhn 712 days ago
This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process.

I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better.

1 comments

i think the key insight, at least for me, is that if you break things into piles of lots of little pieces and then you come up with n different ways of ordering those piles, similar things are going to have the same piece show up on top across those orderings.

add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.