Hacker News new | ask | show | jobs
by senko 560 days ago
The output of an LLM is often qualitative, not quantitative, and to test that, you need something that can judge the quality.

You're not debating philosophy with the LLM, you're just asking it if the answer matches (semantically) to the expected one.

I usually test LLM output quality with the following prompt (simplified):

"An AI assistant was tasked with {task}. The relevant information for their task was {context}. Their answer is {answer}. The correct answer should be something like {ground truth}. Is their answer correct?"

Then you can spice it up with chain of thought, asking it to judge alongside preferred criteria/dimensions and output a score, etc... you can go as wild as you'd like. But even this simple approach tends to work really well.

> turtles all the way down.

Saying "LLM testing LLM" is bad is like saying "computer testing computer" is bad. Yet, automated tests have value. And just as the unit tests will not prove your program is bug free, LLM evals won't guarantee 100% correctness. But they're incredibly useful tool.

In my experience working on pretty complex multi-agent multi-step systems, trying to get those to work without an eval framework in place is like playing whack-a-mole, only way less fun.

2 comments

Too late to edit, but here's a great, really in-depth post about using LLMs as judges to evaluate LLM outputs (when you don't have the ground truth for everything): https://cameronrwolfe.substack.com/p/finetuned-judge This is about finetuning LLMs to do it, but the first part is a good intro to why and how.
> "An AI assistant was tasked with {task}. The relevant information for their task was {context}. Their answer is {answer}. The correct answer should be something like {ground truth}. Is their answer correct?"

If you have a ground truth, what was the purpose of asking the AI assistant for an answer in the first place?

When you're writing a test, you usually know the correct answer for that specific combination of input parameters.
Looking back at it, I must have been very tired when I wrote that!

Or maybe I was thinking about cases where the ground truth is difficult to establish.

The cat is dead. The cat is no longer alive. These are equivalent enough, usually, but fails string comparison.
Or like of you did ai voice calling the goal was XYZ did the conversation get, in so many words, to XYZ?