|
|
|
|
|
by Uninen
368 days ago
|
|
This is wild! "when assessed by Claude 3.5 Sonnet’s production-grade RM, our unsupervised assistant policy wins 60% of head-to-head comparisons against the policy trained with the human-supervised RM." So now the models can even post-train the new models better than a human can |
|