Hacker News new | ask | show | jobs
by tveita 648 days ago
General text embeddings aren't that great for keyword search. The best embedding match for a paragraph is going to be another paragraph with the same meaning. "Cats are natural carnivores." should be very similar to "Felines eat meat.", and somewhat similar to "Tigers eat meat."

"cat" is not a sentence - the embedding should be close to other contextless single words, and maybe just slightly closer to paragraphs about cats than paragraphs about non-cats.

A simple way to use embeddings for search is to generate text of the same kind you expect your users to input - e.g. if you expect users to ask questions in natural language, have a cheap LLM turn paragraphs like "Cats are natural carnivores." into questions like "What do cats eat?" "What is natural behaviour for cats?" and then index the embeddings for those.

1 comments

I wonder what kind of sentence should be the best match when using embeddings to search for things similar to the following grammatically correct sentence:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

https://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffal...

Well… I guess the most relevant results would simply be text from that Wikipedia article about this sentence, or from comments on HN and Reddit about this specific sentence, and then various texts about grammar that talk about similar concepts to this.