Hacker News new | ask | show | jobs
by PaulHoule 1229 days ago
Would be nice to see some indication of how well it works in his case.

I worked on a ‘Semantic Search’ product almost 10 years ago that used a neural network to do dimensional reduction and had inputs to the scoring function from the ‘gist vector’ and the residual word vector which was possible to calculate in that case because the gist vector was derived from the word vector and the transform was reversible.

I’ve seen papers in the literature which come to the same conclusion about what it takes to get good similarity results w/ older models as a significant amount of the meaning in text is in pointy words that might not be included in the gist vector, maybe you do better with an LLM since the vocabulary is huge.

1 comments

I'd honestly argue that he might not have even needed OpenAI embeddings—any off-the-shelf Huggingface model would've sufficed.

Because of attention mechanisms, we no longer so heavily depend on the existence of those "pointy words," so generally, Transformers-based semantic search works quite well.

I actually tried this last year, before OpenAI released their cheaper embeddings v2 in december. From my experiments, when compared to Bert embeddings (or recent variation of the model) the OpenAI embeddings are miles ahead when doing similarity search.
Interesting. Nils Reimers (SBERT guy) wrote on Medium that he found them to perform worse than SOTA models. Though that was, I believe, before December.
Most of the practitioners I see attempting this are running text through an embedding and then using cosine similarity or something similar as a metric.

Nils has written a lot of papers

https://www.nils-reimers.de/

and I think the Medium post you are talking about is

https://medium.com/@nils_reimers/openai-gpt-3-text-embedding...

and that SBERT is a Siamese network over BERT embeddings

https://arxiv.org/abs/1908.10084

which one would expect to do better than cosine similarity if it was trained correctly. I'd imagine the same Siamese network approach he is using would work better than cosine similarity with GPT-3.

There's also the issue of what similarity means for people. I worked on a search engine for patents where the similarity function we wanted was "Document B describes prior art relevant to Patent Application A". Today I am experimenting with a content based recommendation system and face the problem that one news event could spawn 10 stories that appear in my RSS feeds and I'd really like a clustering system that groups these together reliably without false positives.

I'd imagine a system that is great for one of these tasks might be mediocre for the other, in particular I am interested in some kind of data to evaluate success at news clustering.

I was thinking RoBERTa 3, longformer or Big Bird would be a good choice for this, though having any limit on the attention window is a weakness.