| The output of an LLM is often qualitative, not quantitative, and to test that, you need something that can judge the quality. You're not debating philosophy with the LLM, you're just asking it if the answer matches (semantically) to the expected one. I usually test LLM output quality with the following prompt (simplified): "An AI assistant was tasked with {task}. The relevant information for their task was {context}. Their answer is {answer}. The correct answer should be something like {ground truth}. Is their answer correct?" Then you can spice it up with chain of thought, asking it to judge alongside preferred criteria/dimensions and output a score, etc... you can go as wild as you'd like. But even this simple approach tends to work really well. > turtles all the way down. Saying "LLM testing LLM" is bad is like saying "computer testing computer" is bad. Yet, automated tests have value. And just as the unit tests will not prove your program is bug free, LLM evals won't guarantee 100% correctness. But they're incredibly useful tool. In my experience working on pretty complex multi-agent multi-step systems, trying to get those to work without an eval framework in place is like playing whack-a-mole, only way less fun. |