| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by joewferrara 983 days ago
	This is a great article about the technical difficulties of building a RAG system at scale from an engineering perspective. Performance is about speed and compute. A topic that is not addressed is how to evaluate a RAG system where performance is about whether the RAG system is retrieving the correct context and answering questions accurately. A RAG system should be built so that the different parts (retriever, embedder, etc) can easily be taken out and modified to improve the performance of the RAG system at answering questions accurately. Whether a RAG system is answering questions accurately should be assessed during development and then continuously monitored.

2 comments

ddematheu 983 days ago

Co-author of the article here.

You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.

As part of our development process, we have started to enable other tools that we don't talk as much in the article about including a pre processing and embeddings playground (https://www.neum.ai/post/pre-processing-playground) to be able to test different combinations of modules against a piece of text. The idea being that you can establish you ideal pipeline / transformations that can then be scaled.

link

visarga 983 days ago

Did you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.

Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.

Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.

A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.

link

ddematheu 983 days ago

Not at scale. Currently we do some extraction for metadata, but pretty simple. Doing LLM based pre-processing of each chunk like this can be quite expensive especially with billions of them. Summarizing each document before ingestion could cost thousands of dollars when you have billions.

We have been experimenting with semantic chunking (https://www.neum.ai/post/contextually-splitting-documents) and semantic selectors (https://www.neum.ai/post/semantic-selectors-for-structured-d...) but from a scale perspective. For example, if we have 1 millions docs, but we know they are generally similar in format / template, then we can bypass having to use an LLM to analyze them one by one and simply help create scripts to extract the right info.

We think there are clever approaches like this that can help improve RAG while still being scalable.

link

dartos 983 days ago

Do you have any more resources on this topic? I’m currently very interested in scaling and verifying RAG systems.

link

janalsncm 983 days ago

> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?

You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.

Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.

link

ac2u 983 days ago

Yeah, especially if you're experimenting with training and applying a matrix to the embeddings generated by an off the shelf model to help it surface subtleties unique to your domain.

link