Hacker News new | ask | show | jobs
by visarga 640 days ago
The fundamental problem of both keyword and embedding based retrieval is that they only access surface level features. If your document contains 5+5 and you search "where is the result 10" you won't find the answer. That is why all texts need to be "digested" with LLM before indexing, to draw out implicit information and make it explicit. It's also what Anthropic proposes we do to improve RAG.

"study your data before indexing it"

1 comments

Makes sense. It seems after retrieval, both would be useful - both the exact quote and a summary of its context.