| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rajvarkala 226 days ago
	Understood. Take for instance, customer support Agent , that is supposed to resolve tickets. Assuming it resolves around 30% tickets by an objective measure. Do you think that cannot be captured and agreed upon by both sides?

3 comments

deathanatos 226 days ago

Already, today, human customer support agents' performance is measured in ticket resolution, and the Goodhart's Law consequences of that are trivial visible to anyone that's ever tried to get a ticket actually resolved, as opposed to simply marked "resolved" in a ticketing system somewhere…

link

rajvarkala 225 days ago

We just give today's human performance metrics to AI agents.

AI agent developers internally have a metric they are targeting to improve. That itself violates goodhart law.

link

wood_spirit 226 days ago

You get what you measure. The bot might be really bad and customers close the chat and it gets counted as success etc.

link

rajvarkala 226 days ago

The same applies to human agents as well. Humans are incentivised differently ? How?

The same oversight mechanism that applies to humans cannot correct the flaws of AI agents?

link

wood_spirit 225 days ago

Except the meta reason for employing AI for these use cases is to stop employing the humans?

link

HelloMcFly 226 days ago

At scale? Programmatically? In a way that actually saves time and doesn't create billing conflict (that always happens to benefit the LLM vendor)?

No I do not.

link

rajvarkala 226 days ago

Interesting. Let's take the case of infra spend on AWS. Amazon says you invoked serverless calls 100k times and you are charged for it. How are you trusting them?

link

HelloMcFly 225 days ago

The comparison doesn't quite hold because AWS is a utility; they aren't an arbiter of quality. Amazon charges for a serverless call regardless of whether your code worked or crashed. You pay for the effort (compute), which is verifiable and binary.

Once you shift to billing for outcomes like "resolutions," the vendor switches from a utility provider to the judge and jury of their own performance. At scale, that creates a "fox guarding the henhouse" dynamic. The friction of auditing those outcomes to ensure they aren't just Goodharted metrics eventually offsets the simplicity the model promises. Frankly, I just cannot and will not trust the judgment of tech companies who evangelize their own LLM outputs.

link

rajvarkala 225 days ago

How do you verify AWS charges? By inspecting logs? There goes the arbiter.

I get the binary part. The biggest difference is the subjective component of outcome? However, a tech provider - especially Agent provider - has to bring down the subjective to a quantitative metric when selling. If that cannot be done, I am not sure what we are going to be buying from Agent builders/providers?

link