Hacker News new | ask | show | jobs
by Terr_ 484 days ago
> the model has something like a "be evil" feature

That description feels like a stretch to me, since it suggests some anthropomorphism and a binary spectrum.

Perhaps the model trained on mostly-sincere data grows a set of relationships between words/tokens, and training it again with conflicting content fractures its statistical understanding in non-obvious ways.