| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yoan9224 182 days ago

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.

The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?

2 comments

hnthrowaway121 182 days ago

You’ve only wasted the 4 hours if you didn’t spend them doing something else.

At 50/50 it’s an ok bet if the debugging time is much less than the total human time, even if the loops are long, you might rather 4 hours of deep work on an important human thing or on just relaxing vs babysitting the LLM. Assuming that about half the time that will pay off with a correctly done thing with very little effort, it’s kind of amazing.

afro88 182 days ago

> The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.

> What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.

Your first two paragraphs are at odds with each other. If it fails, you've potentially wasted the time it took the agent to *perform* the "it takes humans 4h" long task. Which in most cases is single digit minutes.

That's why one of the solid use cases for agents is doing multiple throw away proof of concepts to explore a problem / new feature before deciding on a solution to actually implement. Usually you'd have time for one, or maybe none. If it fails you've lost a maybe 10 minutes, but likely learned something new about the potential solution.