Hacker News new | ask | show | jobs
Can ELO tournaments be used to evaluate LLMs and RAG? (github.com)
9 points by zavrel 967 days ago
1 comments

LLMs and RAG have made it so much easier to prototype and incorporate AI in production systems, but evaluation has really become a whole lot harder.

Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is mostly out of the question.

Can we ask a powerful LLM to judge between pairs of answers and a set of questions?

We just open sourced a simple tool for tournament-style ELO ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).

What are your thoughts? Do you think this is a good direction to go in for LLM evaluation?