| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by madiator 1316 days ago

Here's an example I can think of. Suppose you have a bunch of text documents, and you know that some documents are similar but not identical (e.g. plagiarized and slightly modified). You want to find out which documents are similar.

You can first run the contents through some sort of embedding model (e.g. the recent OpenAI embedding model [1]), and then apply LSH on those embeddings. The documents that have the same LSH value would have had very similar embeddings, and thus very similar content.

[1] https://beta.openai.com/docs/guides/embeddings