| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Terr_ 484 days ago

> the model has something like a "be evil" feature

That description feels like a stretch to me, since it suggests some anthropomorphism and a binary spectrum.

Perhaps the model trained on mostly-sincere data grows a set of relationships between words/tokens, and training it again with conflicting content fractures its statistical understanding in non-obvious ways.