Hacker News new | ask | show | jobs
by d4rkp4ttern 974 days ago
Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but (Document , question, answer). I know NQ (Natural Questions) is one such dataset:

https://ai.google.com/research/NaturalQuestions

But I do t see this dataset mentioned much in RAG discussions.

1 comments

It's true that there are not a lot of datasets for benchmarking RAG. RAG applications are so tailored to the specific data being used as well as the use case, that a benchmark dataset is not useful across different RAG applications. The data used for a RAG application could be slack messages, technical documentation, insurance policies, internal company microsoft word documents, or a combination of these. For each of these different data source examples, the benchmark dataset would need to be very different.

We recommend that when building a RAG application, the developers build a benchmark dataset specifically tailored to the data being used for the RAG application, and the use case of the RAG application.

Well we could say the same about question answering or Info Retrieval, and even LLMs , yet there are plenty of benchmarks for these. The point of a benchmark is not that it covers all use cases, but that it is agreed upon to be a meaningful way to compare different approaches. I suppose I should dig into RAG papers accepted into ICLR/ICML/NeurIPS and look at their experiments section.