Hacker News new | ask | show | jobs
by ai_slop_hater 27 days ago
What is an LLM evaluator?
2 comments

They should define this, but after having read the entire article I think it’s clear they mean “frameworks for evaluating the output of an agent” rather than what first might come to mind as “LLM evals”.

Their thesis is that even when the eval is useless for correctness of a single agentic action in production, it allows you to choose between two agents by cross-comparing in a large aggregated collection of tasks. Effectively: you can tune your agentic parameters.

Nothing new to the idea that taking many samples and averaging can work when a single datapoint doesn’t. Presumably this is part of a conversation in which we’re lacking context.

Are “frameworks for evaluating the output of an agent” and "LLM evals" different? :) If yes, how?
"LLM evals" is maybe an overused term because it can mean a bunch of things. This article talks about LLM-as-a-judge where an LLM scores another system's outputs.
Any function that can score (i.e. "evaluate") your LLM system (e.g. your agent).

For example:

- You write a heuristic (regex, code, etc.) that assigns a score to an output

- You make another LLM score the output from your system (aka "LLM-as-a-judge")

- You have an automated system that can verify the generated outputs (e.g. does generated code compile or pass tests?)

People often talk about "LLM evals (evaluations)" which will include a set of evaluators i.e. scoring functions.

We'll make this clearer next time!