LLMs and RAG have made it so much easier to prototype and incorporate AI in production systems, but evaluation has really become a whole lot harder.
Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is mostly out of the question.
Can we ask a powerful LLM to judge between pairs of answers and a set of questions?
We just open sourced a simple tool for tournament-style ELO ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).
What are your thoughts? Do you think this is a good direction to go in for LLM evaluation?
Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is mostly out of the question.
Can we ask a powerful LLM to judge between pairs of answers and a set of questions?
We just open sourced a simple tool for tournament-style ELO ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).
What are your thoughts? Do you think this is a good direction to go in for LLM evaluation?