|
|
|
|
|
by getnormality
91 days ago
|
|
> something happened to the model in training (RLHF?) that forcefully degraded its reasoning performance I've been seeing more people speculating like this and I don't understand why. What evidence do we have for RLHF degrading performance on a key metric like reasoning? Why would this be tolerated by model developers? Can someone point to an example of an AI researcher saying "oops, RLHF forcefully degrades reasoning capabilities, oh well, nothing we can do"? It strikes me as conspiracist reasoning, like "there's a car that runs on water but they won't sell it because it would destroy oil profits". |
|
There was some research about it early on that was shared widely and shaped the folklore perception around it, such as the graph in https://static.wixstatic.com/media/be436c_84a7dceb0d834a37b3... from the GPT-4 whitepaper which shows that RLHF destroyed its calibration (ability to accurately estimate the likelihood that its guesses are correct). Of course the field may have moved on in the 2+ years that have passed since then.