Hacker News new | ask | show | jobs
by wokwokwok 1120 days ago
Hm. Title: “beats Chat GPT”

Reality:

> With these evaluation instructions, we compare RLHF model responses to Davinci003 responses and measure the fraction of times the RLHF model is preferred; we call this statistic the win-rate.

> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.

…for the metric we invented, which measures… the difference between a simulated and human evaluated result.

Or something.

Does anyone have a good idea of what this metric actually means and if it is actually relevant to anything useful?

2 comments

The measure is win rate versus DV3. Their model wins more often than ChatGPT

Beating a weaker player more often is not evidence of being able to beat a stronger player on average though

What does “win” mean though?

It improved the simulated win rate vs human win rate?

…but chatgpt had a higher win rate overall? (And gpt4 was much higher)

What is the significance of the difference between simulated and human win rates?

You provide two samples side by side and see what humans prefer.

You should try asking what you don’t know in a non judgemental manner

/shrug

The paper says:

> We find that PPO sim trained in AlpacaFarm only achieves a win-rate of 43%, while PPOGPT-4 sim trained on GPT-4 data achieves a win-rate of 50%. To contextualize these results, the initial SFT model has a win-rate of 44%, PPOhuman has a win-rate of 55%, and the best non-PPO human method has a win-rate of 51% (Best-of-16). Thus, training in simulation can provide good models directly for deployment, though this approach suffers a 5% performance gap relative to collecting real human annotations.

...

> However, we also observe that no single LLM-based annotator captures the heterogeneity of human annotation, and substantial amounts of noise had to be injected in the simulated preference for rankings of methods trained in AlpacaFarm to match those trained with real human feedback.

...and, in summary:

> We showed that AlpacaFarm substantially lowers the cost and iteration time of research on and development of methods for learning with pairwise feedback. AlpacaFarm provides a blueprint for constructing other useful simulators for AI research that requires human supervision, and we view it as an exciting opportunity to expand this simulation approach to support data from other domains as well as methods that learn from alternative forms of human feedback.

Ok.

...but that's no what the blog post said. The blog post said:

> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.

The closest the paper got to saying that was:

> The other mismatch is ChatGPT against PPO, where human annotators preferred PPO (55.1% vs 52.9%) unlike the simulator (46.8% vs 61.4%).

That's interesting.

> In both cases, these are not major mistakes, as we do not expect SFT52k to be much worse than SFT10k or for a 7B LLaMA model to substantially outperform ChatGPT.

?? Mistakes?

So.. I mean, yes. I'm judging. When you write a blog saying "outperforms ChatGPT" and then, the paper doesn't say that... well.

It's a bit shit isn't it?

Yeah, and I think they are using an old version (3.0? 3.5?) of ChatGPT, not GPT4, which is way better. Can anyone verify? They confusingly list GPT4 as a separate LLM, even though ChatGPT supports GPT4.
The confusion is provided entirely by OpenAI, in my opinion.
No one using ChatGPT is confused. You have to make an explicit choice in the switch, and if you are using the API you have to put the name in as a parameter.