| HN Mirror

Most of the practitioners I see attempting this are running text through an embedding and then using cosine similarity or something similar as a metric.

Nils has written a lot of papers

https://www.nils-reimers.de/

and I think the Medium post you are talking about is

https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

and that SBERT is a Siamese network over BERT embeddings

https://arxiv.org/abs/1908.10084

which one would expect to do better than cosine similarity if it was trained correctly. I'd imagine the same Siamese network approach he is using would work better than cosine similarity with GPT-3.

There's also the issue of what similarity means for people. I worked on a search engine for patents where the similarity function we wanted was "Document B describes prior art relevant to Patent Application A". Today I am experimenting with a content based recommendation system and face the problem that one news event could spawn 10 stories that appear in my RSS feeds and I'd really like a clustering system that groups these together reliably without false positives.

I'd imagine a system that is great for one of these tasks might be mediocre for the other, in particular I am interested in some kind of data to evaluate success at news clustering.