Hacker News new | ask | show | jobs
by cosmojg 929 days ago
What was the point of moving away from the base model? I can't stop asking this question. Conversational formatting is achievable with careful prompting and a bit of good old-fashioned heuristic post-processing, and it was easier to achieve consistent results before RLHF took off. Now we still have to do a bunch of prompt hacking to get the results we want[1], but it's more complicated and the performance of the model has degraded significantly[2]. All the cargo culting toward agentic chatbots and away from language prediction engines might please the marketing and investor relations departments, but it's only setting us back in the long run.

[1] https://arxiv.org/pdf/2310.06452.pdf

[2] https://arxiv.org/pdf/2305.14975.pdf

2 comments

Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.

The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.

I recommend reading the articles I linked as what you're saying is not true for most use cases. RLHF as implemented by OpenAI improves performance for one particular use case: chatbots. For every other use case, it degrades performance. The priority for OpenAI right now is to favor perceived performance in turn-based conversation over actual predictive performance, which unfortunately hinders my own usage of an otherwise spectacular base model.
OpenAI provides “instruct” version of their models (Not optimized for chat)
Not for GPT-4, unfortunately. Although, I'm certainly happy that Davinci et al remain available. I just wish they'd committed harder to what they had with code-davinci-002.