Hacker News new | ask | show | jobs
by ankit219 340 days ago
These are all solvable problems. The issue is given the race to get to a certain ARR quickly, many startups end up not focusing on these. There is some truth to AI agents being not as useful as their promise, but the problems mentioned are engineering problems, and once we start seeing them with a different lens, they would start working. (This is not to say I believe orchestration or multi step agents are a way to go, I personally lean heavily towards RL. Just that the criticisms here assumes the state would remain the same even without AI advancement).

Eg: you need good verifiers (to understand whether a task is done successfully or not). Many tasks have easier verifications than doing the task. YOu have five parallel generations with 80% accuracy, the probablity of getting one right (and a verifier which can pick that) goes to 99.96%. With multi step too, the math changes in a similar manner. It just needs a different approach than how we have built software till date. He even hints at a paradigm with 3-5 discrete step workflow which works superbly well. We need to build more in that way.

4 comments

> Many tasks have easier verifications than doing the task.

In the software world (like the article is talking about) this is the logic that has ruthlessly cut software QA teams over the years. I think quality has declined as a result.

Verifiers are hard because the possible states of the internal system + of the external world multiply rapidly as you start going up the component chain towards external-facing interfaces.

That coordination is the sort of thing that really looks appealing for LLMs - do all the tedious stuff to mock a dependency, or pre-fill a database, etc - but they have an unfortunate tendency to need to be 100% correct in order for the verification test that depends on them to be worth anything. So you can go further down the rabbit hole, and build verifiers for each of those pre-conditions. This might recurse a few times. Now you end up with the math working against you - if you need 20 things to all be 100%, then even high chances of each individual one starts to degrade cumulatively.

A human generally wouldn't bother with perfect verification of every case, it's too expensive. A human would make some judgement calls of which specific things to test in which ways based on their intimate knowledge of the code. White box testing is far more common than black box testing. Test a bunch of specific internals instead of 100% permutations of every external interface + every possible state of the world.

But if you let enough of the code to solve the task be LLM-generated, you stop being in a position to do white-box testing unless you take the time to internalize all the code the machine wrote for you. Now your time savings have shrunk dramatically. And in the current state of the world, I find myself having to correct it more often then not, further reducing my confidence and taking up more time. In some places you can try to work around this by adjusting your interfaces to match what the LLM predicts, but this isn't universal.

---

In the non-software world the situation is even more dire. Often verification is impossible without doing the task. Consider "generate a report on the five most promising gaming startups" - there's no canonical source to reference. Yet these are things people are starting to blindly hand off to machines. If you're an investor doing that to pick companies, you won't even find out if you're wrong until it's too late.

This is not an NxM verifier hell. I explicitly talked about one way which is parallel generation + classifier. You can also use majority voting here. Both would give you the right answer at each step without having to write code or test cases, just a simple prompt. There are more ways to do the same, eg: verifier blocks, layering, backtracking search (end to end assertions and then see which step went wrong), simple generative verifiers with simpler prompts and so on.

For non software world, people use majority voting most of the time.

That’s a common fallacy. I suggest you make a plot of failure rate vs amount of components that can fail, any one of them failing leading to a total failure. You’ll be shocked by how quickly you get terrible numbers.
I talk about it from experience. How else do you think people are training RL agents if not based on verifiers? You don't have to verify every output at every step, you just need enough to course correct the agent and catch early when it's going wrong. That is the exact fallacy I was trying to address. The optimization comes from realizing the critical checks and then what passes to the next step. Requires letting go of the previous thinking and changing paradigms.

The failure rate is high because you view it in series. At test time you need to know what is correct from the options (including nothing correct), you dont need to know why it failed. You can debug later. The challenge is how easily can you return to the right track.

Is it reasonable to assume the five generations are independent?
They are not completely independent. It's a good assumption though. If a model encounters something out of distribution then all five of the generations will fail. If the model knows and went in a wrong direction (due to lack of reliability), within five generations, it can be corrected. You need evals, runtime verifiers as basic harness for AI systems.
This is correct. Multiple different agents trying, multiple retries, and many other different solutions can help with this. I have seen agents try one method, get negative feedback, and then try another working method.