| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by maxall4 51 days ago
	Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

2 comments

nl 51 days ago

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

link

mordymoop 50 days ago

This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

link

baq 50 days ago

Or in two words, managing variance.

Play some holdem folks and keep track of how many times you lost with pocket aces.

link

jwood27 51 days ago

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

link

jadar 51 days ago

That’s even smaller then!

link