Hacker News new | ask | show | jobs
by addictedcs 1815 days ago
For binary vectors you can choose a different distance metric (not geometric one, i.e. Jaccard) that can be used to effectively hash similar data points into similar buckets.

Treating your binary vector as a set allows you to use min-hashing as your LSH schema (min-hashing is just a random permutation of the given set). This simple trick makes LSH with min-hashing quite a powerful tool for binary vectors that are extensively used in recommenders systems and other domains.

I've used LSH + Min-Hash for image search (and subsequently for audio fingerprinting). If interested, I've blogged about it here [1].

[1] - https://emysound.com/blog/open-source/2020/06/12/how-audio-f...

1 comments

Agree. Also cosine distance.