Hacker News new | ask | show | jobs
by maxall4 4 hours ago
Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/
2 comments

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.
That’s even smaller then!