|
|
|
|
|
by wokwokwok
1120 days ago
|
|
Hm. Title: “beats Chat GPT” Reality: > With these evaluation instructions, we compare RLHF model responses to Davinci003 responses and measure the fraction of times the RLHF model is preferred; we call this statistic the win-rate. > Of the methods we studied, PPO proves the most effective, improving the win-rate against Davinci003 from 44% to 55% according to human evaluation, which even outperforms ChatGPT. …for the metric we invented, which measures… the difference between a simulated and human evaluated result. Or something. Does anyone have a good idea of what this metric actually means and if it is actually relevant to anything useful? |
|
Beating a weaker player more often is not evidence of being able to beat a stronger player on average though