Hacker News new | ask | show | jobs
by hannofcart 333 days ago
> Let's do the math. If each step in an agent workflow has 95% reliability, which is optimistic for current LLMs,then: 5 steps = 77% success rate 10 steps = 59% success rate 20 steps = 36% success rate Production systems need 99.9%+ reliability.

(End quote)

Isn't this just wrong? Isn't the author conflating accuracy of LLM output in each step to accuracy of final artifact which is a reproducible deterministic piece of code?

And they're completely missing that a person in the middle is going to intervene at some point to test it and at that point the output artifact's accuracy either goes to 100% or the person running the agent would backtrack.

Either am missing something or this does not seem well thought through.

3 comments

How is it that the final result is a reproducible deterministic piece of code, when the prompts become the "source code" itself, and the underlying model used is constantly changing (being updated), which is equivalent to your programming language changing its semantics every other day and refusing to tell you exactly what has changed (because they can't). Not to mention the nondeterminism that a lot of times is present due to nondeterministic order of evaluation when parallelizing?
He's not wrong. The numbers are too pessimistic, however when building software the numbers don't need to be as high for a complete disaster to happen. Even if just 1% of the code is bad, it is still very difficult to make this work.

And you mention testing, which certainly can be done. But when you have a large product and the code generator is unreliable (which LLMs always are), then you have to spend most of your time testing.

Did you even finish the article? The end is all about the trade-off of when "a person in the middle is going to intervene".

In fact, the point of the whole article isn't that AI doesn't work; to the contrary, it's that long chains of (20+) actions with no human intervention (which many agentic companies promise) don't work.