| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by visarga 1027 days ago

Did you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.

Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.

Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.

A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.

2 comments

ddematheu 1027 days ago

Not at scale. Currently we do some extraction for metadata, but pretty simple. Doing LLM based pre-processing of each chunk like this can be quite expensive especially with billions of them. Summarizing each document before ingestion could cost thousands of dollars when you have billions.

We have been experimenting with semantic chunking (https://www.neum.ai/post/contextually-splitting-documents) and semantic selectors (https://www.neum.ai/post/semantic-selectors-for-structured-d...) but from a scale perspective. For example, if we have 1 millions docs, but we know they are generally similar in format / template, then we can bypass having to use an LLM to analyze them one by one and simply help create scripts to extract the right info.

We think there are clever approaches like this that can help improve RAG while still being scalable.

dartos 1027 days ago

Do you have any more resources on this topic? I’m currently very interested in scaling and verifying RAG systems.