|
|
|
|
|
by comex
252 days ago
|
|
This is a parody but the phenomenon is real. My uninformed suspicion is that this kind of defensive programming somehow improves performance during RLVR. Perhaps the model sometimes comes up with programs that are buggy enough to emit exceptions, but close enough to correct that they produce the right answer after swallowing the exceptions. So the model learns that swallowing exceptions sometimes improves its reward. It also learns that swallowing exceptions rarely reduces its reward, because if the model does come up with fully correct code, that code usually won’t raise exceptions in the first place (at least not in the test cases it’s being judged on), so adding exception swallowing won’t fail the tests even if it’s theoretically incorrect. Again, this is pure speculation. Even if I’m right, I’m sure another part of the reason is just that the training set contains a lot of code written by human beginners, who also like to ignore errors. |
|