Hacker News new | ask | show | jobs
by wmwmwm 1101 days ago
Does anyone have any insight into why reinforcement learning is (maybe) required/historically favoured? There was an interesting paper recently suggesting that you can use a preference learning objective directly and get a similar/better result without the RL machinery - but I lack the right intuition to know whether RLHF offers some additional magic! Here’s the “ Direct Preference Optimization ” paper: https://arxiv.org/abs/2305.18290
2 comments

> Does anyone have any insight into why reinforcement learning is (maybe) required/historically favoured?

From a concept stage, it has attractive similarities to the way people learn in real life (rewarded for successful learnings, punished for failure), and although we know similarities to nature don’t guarantee better results than alternatives (for example, our modern airplane does not “flap” its wings the way a bird does), natural solutions will be continually looked to as a starting point and tool to try on new problems.

Additionally, RL gives you a good start on unclear-how-to-address problems. In spaces where it’s not clear where to begin optimizing besides taking actions and seeing how they do judged against some metric, reinforcement learning often provides a good mental and code framework for attacking these problems.

>There was a paper recently suggesting that you can use a preference learning objective directly

Doing a very quick skim, it looks like that paper is arguing rather than giving rewards or punishments based on preferences, you can just build a predictive classifier for the kinds of responses humans prefer. It seems interesting, though I wonder the extent to which you still have to occasionally do that reinforcement learning to generate relevant data for evaluating the classifier.

My intuition on this:

Maximum likelihood training -> faithfully represent training data

Reinforcement learning -> seek out the most preferred answer you can