Hacker News new | ask | show | jobs
by rajvarkala 179 days ago
Understood.

Take for instance, customer support Agent , that is supposed to resolve tickets. Assuming it resolves around 30% tickets by an objective measure. Do you think that cannot be captured and agreed upon by both sides?

3 comments

Already, today, human customer support agents' performance is measured in ticket resolution, and the Goodhart's Law consequences of that are trivial visible to anyone that's ever tried to get a ticket actually resolved, as opposed to simply marked "resolved" in a ticketing system somewhere…
We just give today's human performance metrics to AI agents.

AI agent developers internally have a metric they are targeting to improve. That itself violates goodhart law.

You get what you measure. The bot might be really bad and customers close the chat and it gets counted as success etc.
The same applies to human agents as well. Humans are incentivised differently ? How?

The same oversight mechanism that applies to humans cannot correct the flaws of AI agents?

Except the meta reason for employing AI for these use cases is to stop employing the humans?
At scale? Programmatically? In a way that actually saves time and doesn't create billing conflict (that always happens to benefit the LLM vendor)?

No I do not.

Interesting. Let's take the case of infra spend on AWS. Amazon says you invoked serverless calls 100k times and you are charged for it. How are you trusting them?
The comparison doesn't quite hold because AWS is a utility; they aren't an arbiter of quality. Amazon charges for a serverless call regardless of whether your code worked or crashed. You pay for the effort (compute), which is verifiable and binary.

Once you shift to billing for outcomes like "resolutions," the vendor switches from a utility provider to the judge and jury of their own performance. At scale, that creates a "fox guarding the henhouse" dynamic. The friction of auditing those outcomes to ensure they aren't just Goodharted metrics eventually offsets the simplicity the model promises. Frankly, I just cannot and will not trust the judgment of tech companies who evangelize their own LLM outputs.

How do you verify AWS charges? By inspecting logs? There goes the arbiter.

I get the binary part. The biggest difference is the subjective component of outcome? However, a tech provider - especially Agent provider - has to bring down the subjective to a quantitative metric when selling. If that cannot be done, I am not sure what we are going to be buying from Agent builders/providers?