| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by stevetjoa 4339 days ago

Shameless self-promotion: in my Stack Overflow answer [1], I reference good introductory LSH papers [2-5].

In short, LSH is an algorithm that hashes points that are nearby in a feature space into the same bin with high probability. Contrast that with cryptographically secure hashes where the tiniest change in the input is designed to yield a completely different hash. The point is that, in domains like multimedia, you want to tolerate some distortions to your signal, e.g. microphone noise, blur, etc. These minor distortions shouldn't affect your characterization of the data, e.g. "is this a guitar", "is this a cat", etc.

The advantages are that it's simple to implement, and it has mathematically provable probability bounds and query complexity.

[1] http://stackoverflow.com/questions/5751114/nearest-neighbors...

[2] http://www.cs.princeton.edu/courses/archive/spr05/cos598E/bi...

[3] http://www.vldb.org/conf/1998/p194.pdf

[4] http://www.vldb.org/conf/1999/P49.pdf

[5] http://web.iitd.ac.in/~sumeet/Slaney2008-LSHTutorial.pdf