|
|
|
|
|
by ben_w
35 days ago
|
|
Indeed. To add to this, the obvious solution (ask the AI to break down the tasks to whatever METR says they'd be capable of 80% of the time) is of limited utility, as the AI are only so-so at estimating task complexity. (Even when they're getting the planning part right, I do also recommend checking the LLM-generated unit tests, because in my experience some of those are "regex the source code" not "execute functions and check outputs"). |
|