|
|
|
|
|
by trjordan
2 hours ago
|
|
This is RL, right? Like, this is exactly why models have mostly converged around obvious style, because we train them literally on thumbs-up/thumbs-down data of what good behavior and good code looks like. And that's why it's so hard to get a model to reproduce the specific taste of a person or an organization. My taste is different than yours, so if we dump our aggregate preferences into RL, in averages out to nothing interesting. For the code-writing case, this means you end up reviewing every line of code, looking for places where you'd thumbs-down the code. Not every line of code contains a real decision, though, so it feels like a waste of time. |
|