|
|
|
|
|
by jablongo
484 days ago
|
|
To me this is not particularly surprising, given that the tasks they are fine-tuned on are in some way malevolent or misaligned (generating code with security vulnerabilities without telling the user; generating sequences of integers associated with bad things like 666 and 911). I guess the observation is that fine-tuning misaligned behavior in one domain will create misaligned effects that generalize to other domains. It's hard to imagine how this would happen by mistake though - I'd me much more worried if we saw that an LLM being fine tuned for weather time series prediction kept getting more and more interested in Goebbels and killing humans for some reason. |
|
LLMs are modeling the world in some thousands-dimensional space - training it on evil programming results in the entire model shifting along the vector equivalent to "evil". Good news is it should be possible to train it in other directions too.