|
|
|
|
|
by in-silico
45 days ago
|
|
Either someone hard-coded it in a system prompt to the reward model (similar to how they hard-coded it out), or the reward model mixed up some kind of correlation/causation in the human preference data (goblins are often found in good responses != goblins make responses good). It's also possible that human data labellers really did think responses with goblins were better (in small doses). |
|
I doubt this is the case, if so it wouldn't have taken an investigation to try to trace the root cause.