|
|
|
|
|
by twotwotwo
183 days ago
|
|
METR is using hours of equivalent human effort, not actual hours the agent itself spends, so by their methodology, your task might qualify as one where it pulls off much more than 4h of human work. "Human hours equivalent" itself is an interesting metric, because: which human? Or rather, I'm sure they had a coherent definition in mind: presumably a human reasonably competent at whatever the specific task is. But hours the abstract human standard would spend is different from the hours any specific person, say you or I, would spend. In particular, some of the appeal (and risk!!) of these things is precisely that you can ask for help with things that would be quick work for someone (who knows jq, or a certain corner of the PyPI library ecosystem, or modern CSS, or TypeScript annotations, or something else) but not for you. |
|