| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zavrel 966 days ago

LLMs and RAG have made it so much easier to prototype and incorporate AI in production systems, but evaluation has really become a whole lot harder.

Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is mostly out of the question.

Can we ask a powerful LLM to judge between pairs of answers and a set of questions?

We just open sourced a simple tool for tournament-style ELO ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).

What are your thoughts? Do you think this is a good direction to go in for LLM evaluation?