|
|
|
|
|
by zavrel
966 days ago
|
|
LLMs and RAG have made it so much easier to prototype and incorporate AI in production systems, but evaluation has really become a whole lot harder. Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is mostly out of the question. Can we ask a powerful LLM to judge between pairs of answers and a set of questions? We just open sourced a simple tool for tournament-style ELO ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't). What are your thoughts? Do you think this is a good direction to go in for LLM evaluation? |
|