Hacker News new | ask | show | jobs
by alberth 180 days ago
So who's the arbiter to determine if the outcome was achieved?

And how do you programmatically measure it?

4 comments

The obvious solution is just to throw more LLM's at it to verify the output of the other LLM and that it is doing its job...

\s (mostly because you know this will be the "Solution" that many will just run with despite the very real issue of how "persuadable" these systems are)...

The real answer is that even that will fail and there will have to be a feedback loop with a human that will likely in many cases lead to more churn trying to fix the work the AI did vs if the human just did it in the first place.

Instead of focusing on the places that using an AI tool can truly cut down on time spent like searching for something (which can still fail but at least the risk when a failure is far lower vs producing output).

Hi alberth,

I'd assume an outcome is a negotiated agreement between buyer and Agent provider.

Think of all the n8n workflows. If we take a simple example of Expense receipt processing workflows, or a lead sourcing workflow, I'd think the outcomes can be counted pretty well. In these cases, successfully entered receipts into ERP or number of Entries captured in salesforce.

I am sure there are cases where outcomes are fuzzy, for instances employer-employee agreement.

But in some cases, for instance, my accounting agent would only get paid if he successfully uploads my tax returns.

Surely not applicable in all cases. But, in cases Where a human is measured on outcomes, the same should be applicable for agents too, I guess

> But in some cases, for instance, my accounting agent would only get paid if he successfully uploads my tax returns.

I think you'd want it to correctly compute your taxes. Especially if you get a letter a year or two after the fact saying you owe the government money

Indeed. The whole AI game is predicated on the fact that they can deliver work equivalent to humans in some cases. If that is never going to be the case, then this whole agentic stuff goes belly-up.

The alternative scenario is they get better and do some work really well. That is an interesting territory to focus on.

This is the problem with this, in simple cases like “you add N employees” then you can vaguely approximate it, like they do in the article.

But for anything that’s not this trivial example, the person who knows the value most accurately is … the customer! Who is also the person who is paying the bill, so there’s strong financial incentive for them not to reveal this info to you.

I don’t think this will work …

I often go back to customer support voice AI agent example. Let's say, The bot can resolve tickets successfully at a certain rate . This is capturable easily. Why is this difficult? What cases am I missing?
That's litterlly the job of a founder. You talk to cusomters and learn from them.