| HN Mirror

What I've noticed is that it's hard to measure outputs that aren't binary right or wrong, and that's where most human intervention is needed. The biggest examples of this are chatbots and coding agents – basically any output where you can say "hmm well that's a good response, but there is a better response" and that's what still _feels_ like an unsolved problem, benchmarking those kinds of responses.

On top of that, there are combinations of models+prompts that give different results. For example a prompt could yield a great response from Claude, but the same prompt could yield a mediocre response from Gemini. Not just that but different models have different capabilities (example of this is that composite function calling doesn't work the same way for all models).

I'm asking because I'm generally curious on how teams are solving this today – and it _seems_ like there is no gold standard for evals yet although it's gaining interest.

How I do evals today is by testing an output across different dimensions (and it can vary based on use-case): relevance, instruction following, clarity, hallucination rate, etc. which sucks a lot of time (and can never be fully accurate because how do you fully measure something like "clarity"?), and I feel like there's a better way out there.