|
|
|
|
|
by dekhn
712 days ago
|
|
This is one of those techniques I don't understand when I read it as text, but will instantly absorb when I have a working code example and pass some of my own data through it a few times, inspecting the working process. I first learned about this technique from Douglas Eck (https://research.google/people/douglas-eck/), who used it for song clustering at Google. I remember him saying something about hashing and random vectors, and was puzzled because I figured that less random optimiztion would work better. |
|
add in the banding stuff and some simple probability and voila! a cheap and embarrassingly parallel way to approximate jaccard similarity across huge datasets.