|
|
|
|
|
by yoan9224
182 days ago
|
|
The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed. What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed. This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds. The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path? |
|
At 50/50 it’s an ok bet if the debugging time is much less than the total human time, even if the loops are long, you might rather 4 hours of deep work on an important human thing or on just relaxing vs babysitting the LLM. Assuming that about half the time that will pay off with a correctly done thing with very little effort, it’s kind of amazing.