Can ELO tournaments be used to evaluate LLMs and RAG?

LLMs and RAG have made it so much easier to prototype and incorporate AI in production systems, but evaluation has really become a whole lot harder.

Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is mostly out of the question.

Can we ask a powerful LLM to judge between pairs of answers and a set of questions?

We just open sourced a simple tool for tournament-style ELO ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).

What are your thoughts? Do you think this is a good direction to go in for LLM evaluation?