| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by foobar502 1272 days ago
	Read the article and googled a bit - what are more example use cases for LSH behind those described?

6 comments

tylerneylon 1272 days ago

The use case I see the most in my career is to use LSH to help solve the "ANN" problem = approximate nearest neighbors (typically with ranked results). I've seen ANN used many times for near-duplicate detection and in recommendation systems.

Although I don't have access to the proprietary code used, it's most likely that an LSH algorithm is behind the scenes in every modern search engine (to avoid serving duplicates), many modern ranking systems such as Elasticsearch (because items are typically vectorized and retrieved based on these vectors), and most recommendation systems (for similar reasons as ranking). For example, all of these pages probably have an LSH algorithm at some point (either batch processing before your request, or in some cases real-time lookups):

* Every search result page on Google * Every product page on Amazon (similar products) * All music suggestions on Spotify or similar * Every video recommendation from TikTok, YouTube, or Instagram

etc.

link

molodec 1272 days ago

Another interesting use case for LSH is search results caching. Used by Amazon https://www.linkedin.com/feed/update/urn:li:activity:6943348... https://www.amazon.science/blog/more-efficient-caching-for-p...

link

senderista 1272 days ago

Yes, e.g. many IR systems use cosine similarity to compute query vector/term vector similarity, and simhashing approximates cosine similarity. OTOH, some IR systems instead use a set-theoretic measure, Jacquard similarity, which can be approximated by minhashing.

link

madiator 1272 days ago

Here's an example I can think of. Suppose you have a bunch of text documents, and you know that some documents are similar but not identical (e.g. plagiarized and slightly modified). You want to find out which documents are similar.

You can first run the contents through some sort of embedding model (e.g. the recent OpenAI embedding model [1]), and then apply LSH on those embeddings. The documents that have the same LSH value would have had very similar embeddings, and thus very similar content.

[1] https://beta.openai.com/docs/guides/embeddings

link

zamalek 1272 days ago

Collision detection in games. This problem is O(n^2) because you have to check every object against every other object.

You can almost only check objects that inhabit the same buckets (there are caveats, usually neighboring buckets are also checked), eliminating objects that couldn't possibly collide by virtue of e.g. being on the other side of the map. Of course this is still O(n^2) because every object could be in the same bucket (but that's unlikely).

link

nl 1272 days ago

Face matching. You use embedding created by DeepFace of MediaPipe and put them in a LSH and then the same face ends up close to each other.

link

magic123_ 1272 days ago

In Stuff Made Here's most recent video, he uses LSH to solve a 4000 pieces all-white jigsaw puzzle: https://www.youtube.com/watch?v=WsPHBD5NsS0

link

ww520 1272 days ago

Geohash is one use case. Latitude and longitude address is converted to a geohash which has the nice property of nested rectangles have the same hash prefix.

link