| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thepasch 2 hours ago
	Yeah, I suspect RLHF conditioning heavily discourages models from ever implying that the user could be in the wrong (or, rather, to assume that they are in the wrong by default, since editing a file isn't really "wrong" per se). Though looking at the reactions to Opus 4.8, which has a more contrarian nature and caught a lot of flak as a result, that's probably for a reason. It's also the reason why I ran the two tests on open weights models with unredacted thinking traces. Gemma never flagged anything in its response either, only in its thinking. Without knowing how the summarizer models are prompted, it's impossible to tell whether it was a genuine miss or just something the summarizer decided to omit.

1 comments

Lwerewolf 1 hour ago

DS4-Flash definitely stands its ground when I'm obviously wrong (i.e. me reading ifneq as ifeq for several minutes straight), and I've seen at least once a "thinking" trace that was almost verbatim "the user has changed this". That's local, so thinking traces are raw. Pretty sure the more powerful models (500+GB weights, closed SOTA, etc) are even better at this - haven't had GPT5.5 with codex sugar coat things for me.

link