> We find that PPO sim trained in AlpacaFarm only achieves a win-rate of 43%, while PPOGPT-4
sim trained on GPT-4 data achieves a win-rate of 50%. To contextualize these results, the initial SFT model has a win-rate of 44%, PPOhuman has a win-rate of 55%, and the best non-PPO human method has a win-rate of 51% (Best-of-16). Thus, training in simulation can provide good models directly for deployment, though this approach suffers a 5% performance gap relative to collecting
real human annotations.
...
> However, we also observe that no single LLM-based annotator captures the heterogeneity of human annotation, and substantial amounts of noise had to be injected in the simulated preference for rankings of methods trained in AlpacaFarm to match those trained with real human feedback.
...and, in summary:
> We showed that AlpacaFarm substantially lowers the cost and iteration time of research
on and development of methods for learning with pairwise feedback. AlpacaFarm provides a blueprint for constructing other useful simulators for AI research that requires human supervision, and we view it as an exciting opportunity to expand this simulation approach to support data from other domains as well as methods that learn from alternative forms of human feedback.
Ok.
...but that's no what the blog post said. The blog post said:
> Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT.
The closest the paper got to saying that was:
> The other mismatch is ChatGPT against PPO, where human annotators preferred PPO (55.1% vs 52.9%) unlike the simulator (46.8% vs 61.4%).
That's interesting.
> In both cases, these are
not major mistakes, as we do not expect SFT52k to be much worse than SFT10k or for a 7B LLaMA model to substantially outperform ChatGPT.
?? Mistakes?
So.. I mean, yes. I'm judging. When you write a blog saying "outperforms ChatGPT" and then, the paper doesn't say that... well.
It improved the simulated win rate vs human win rate?
…but chatgpt had a higher win rate overall? (And gpt4 was much higher)
What is the significance of the difference between simulated and human win rates?