Hacker News new | ask | show | jobs
by nelsondev 1349 days ago
Seems the author is proposing LSH instead of vectors for doing ANN?

There are benchmarks here, http://ann-benchmarks.com/ , but LSH underperforms the state of the art ANN algorithms like HNSW on recall/throughput.

LSH I believe was state of the art 10ish years ago, but has since been surpassed. Although the caching aspect is really nice.

4 comments

To elaborate on Noe's comment, the article is suggesting the use of LSH where the hashing function is learned by a neural network such that similar vectors correspond to similar hashes via Hamming weight (whilst enforcing some load factor). In effect, a good hash is generated by a neural network. It appears Elastiknn a prioi chooses the hash function? Not sure, not my area of knowledge.

This approach seems feasible tbh. For example, a stock's historical bids/asks probably don't deviate greatly from month to month. That said, the generation of a good hash is dependent on the stock ticker, and a human doesn't have the time to find a good one for every stock at scale.

It is true that HNSW outperforms LSH on recall and throughput, but for some use cases LSH outperforms HNSW. I just deployed this week to prod a new system for short text streaming clustering using LSH. I used algorithms from this crate that I also built https://github.com/serega/gaoya

HNSW index is slow to construct, so it is best suited for search or recommendation engines where you build the index and serve. For workloads where you continuously mutate the index, like streaming clustering/deduplication LSH outperforms HNSW.

LSH is a _technique_, whose performance vastly/mostly depends on the hashing function and on how this function enables neighborhood exploration.

It might not be trendy, but it doesn't mean it can't work as good or better than HNSW. It all depends on the hashing function you come up with.

when combined with minhashing it approximates jaccard similarity, so it seems it would be bounded by that.
10? no, it's more like 20+. lsh was a core piece of the google crawler. it was used for high performance fuzzy deduplication.

see ullman's text: mining massive datasets. it's free on the web.

I think LSH was only introduced in 99 by Indyk et. al. I would say it was a pretty active research area 10 years ago.
right, but massive scale production use in the google crawler to index the entire internet when that was at the bleeding edge was state of the art before the art was even really recognized as an art.

i don't even think they called it ANN. it was high performance, scalable deduplication. (which is, in fact, just fast/scalable lossy clustering)

collaborative filtering was kind of a cute joke at the time. meanwhile they had lsh, in production, actually deduplicating the internet.