Interesting. Nils Reimers (SBERT guy) wrote on Medium that he found them to perform worse than SOTA models. Though that was, I believe, before December.
Most of the practitioners I see attempting this are running text through an embedding and then using cosine similarity or something similar as a metric.
which one would expect to do better than cosine similarity if it was trained correctly. I'd imagine the same Siamese network approach he is using would work better than cosine similarity with GPT-3.
There's also the issue of what similarity means for people. I worked on a search engine for patents where the similarity function we wanted was "Document B describes prior art relevant to Patent Application A". Today I am experimenting with a content based recommendation system and face the problem that one news event could spawn 10 stories that appear in my RSS feeds and I'd really like a clustering system that groups these together reliably without false positives.
I'd imagine a system that is great for one of these tasks might be mediocre for the other, in particular I am interested in some kind of data to evaluate success at news clustering.
Nils has written a lot of papers
https://www.nils-reimers.de/
and I think the Medium post you are talking about is
https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...
and that SBERT is a Siamese network over BERT embeddings
https://arxiv.org/abs/1908.10084
which one would expect to do better than cosine similarity if it was trained correctly. I'd imagine the same Siamese network approach he is using would work better than cosine similarity with GPT-3.
There's also the issue of what similarity means for people. I worked on a search engine for patents where the similarity function we wanted was "Document B describes prior art relevant to Patent Application A". Today I am experimenting with a content based recommendation system and face the problem that one news event could spawn 10 stories that appear in my RSS feeds and I'd really like a clustering system that groups these together reliably without false positives.
I'd imagine a system that is great for one of these tasks might be mediocre for the other, in particular I am interested in some kind of data to evaluate success at news clustering.