Hacker News new | ask | show | jobs
by computerex 929 days ago
Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.

The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.

1 comments

I recommend reading the articles I linked as what you're saying is not true for most use cases. RLHF as implemented by OpenAI improves performance for one particular use case: chatbots. For every other use case, it degrades performance. The priority for OpenAI right now is to favor perceived performance in turn-based conversation over actual predictive performance, which unfortunately hinders my own usage of an otherwise spectacular base model.