|
|
|
|
|
by Terr_
484 days ago
|
|
> the model has something like a "be evil" feature That description feels like a stretch to me, since it suggests some anthropomorphism and a binary spectrum. Perhaps the model trained on mostly-sincere data grows a set of relationships between words/tokens, and training it again with conflicting content fractures its statistical understanding in non-obvious ways. |
|