Hacker News new | ask | show | jobs
by visarga 977 days ago
I was expecting the authors to talk about superficial encoding of information (words and phrases) as opposed to their meaning or implications. For example the text "The general idea behind Q-learning and SARSA" won't embed the same with "reinforcement". Or "Count the letters in this phrase." won't embed the same with "27".

This can cause RAG systems to fail to retrieve, fail to connect fragmented information or to conclude the result of a process. My theory is that information needs to be digested by a LLM and augmented before indexing in a RAG system. Embeddings are just searching at surface level. That is why I thought AutoGPT had difficulties, one of the reasons at least.

Maybe we can have LLMs preprocess the material to expose the deeper layers of meaning. And we need to reindex everything when we discover we are interested in an aspect we didn't explicitly expose for retrieval. Study, then index, and sometimes study again.