|
|
|
|
|
by chis
247 days ago
|
|
I think there’s always a danger of these foundational model companies doing RLHF on non-expert users, and this feels like a case of that. The AIs in general feel really focused on making the user happy - your example, and another one is how they love adding emojis to the stout and over-commenting simple code. |
|
With RLVR, the LLM is trained to pursue "verified rewards." On coding tasks, the reward is usually something like the percentage of passing tests.
Let's say you have some code that iterates over a set of files and does processing on them. The way a normal dev would write it, an exception in that code would crash the entire program. If you swallow and log the exception, however, you can continue processing the remaining files. This is an easy way to get "number of files successfully processed" up, without actually making your code any better.