Hacker News new | ask | show | jobs
by madiator 1266 days ago
Here's an example I can think of. Suppose you have a bunch of text documents, and you know that some documents are similar but not identical (e.g. plagiarized and slightly modified). You want to find out which documents are similar.

You can first run the contents through some sort of embedding model (e.g. the recent OpenAI embedding model [1]), and then apply LSH on those embeddings. The documents that have the same LSH value would have had very similar embeddings, and thus very similar content.

[1] https://beta.openai.com/docs/guides/embeddings