Hacker News new | ask | show | jobs
by budududuroiu 778 days ago
My issue with RAG systems isn’t hallucinations. Yes sure those are important. My issue is recall. Given petabyte-scale index of chunks, how can I make sure that my RAG system surfaces the “ground truth” I need, and not just “the most similar vector”.

This I think is scarier. A healthcare-oriented (or any industry) RAG retrieving a bad, but highly linguistically similar answer.

1 comments

You're correctly identifying an issue that by now I think everyone is facing globally: Realizing the bottleneck to performance or improvements of LLMs isn't necessarily quantity, but inevitably quality.

Which is a much harder problem to solve outside few highly standardized niches/ industries.

I think synthetic data generation as a mean to guide LLMs over a larger than optimal search space is going to be quite interesting.

To me synthetic data generation makes no sense. Mathematically your LLM is learning a distribution (let’s say of human knowledge). Let’s assume your LLM models human knowledge perfectly. In that case, what can you achieve? Just sampling the same data that your model mapped perfectly.

However, if your models distribution is wrong, you’re basically going to have an even more skewed distribution in models trained using the synthetic data.

To me, it seems like the architecture is the next place for improvements. If you can’t synthesise the entirety of human knowledge using transformers, there’s an issue there.

The smell that points me in that direction is the fact that up until recently, you could quantise models heavily with little drop in performance, but recent Llama3 research shows that’s not the case anymore