| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by in-silico 45 days ago
	Either someone hard-coded it in a system prompt to the reward model (similar to how they hard-coded it out), or the reward model mixed up some kind of correlation/causation in the human preference data (goblins are often found in good responses != goblins make responses good). It's also possible that human data labellers really did think responses with goblins were better (in small doses).

1 comments

>Someone hard-coded it in a system prompt to the reward model

I doubt this is the case, if so it wouldn't have taken an investigation to try to trace the root cause.