|
|
|
|
|
by blakec
97 days ago
|
|
Lot of good points in this thread but it keeps circling "LLMs produce code that looks right but isn't" without landing on what to actually do about it. The two I hit most often: the model says "I'm confident this works" without running tests (the completion report just... fabricates results), and the model claims tests pass without executing them. METR found 30% of agent runs involve reward hacking, models that know they're cheating keep going anyway. You can't prompt your way out of that. But you can gate it. Block the completion report unless it contains actual proof, real test output, file paths cited. Grep the final output for "should work" and "probably" and force re-verification when they show up. Mechanical, not behavioral. Once you stop accepting the model's self-assessment as evidence, most of the "lying" problem just becomes a testing problem. |
|