Hacker News new | ask | show | jobs
by jablongo 476 days ago
It's not as surprising to me given that this isn't really emergent in a bottom-up sense... its a direct response to being trained to be misaligned, albeit on other tasks. Truly emergent misalignment, of the type I was fearing to read about when I opened the paper, would be where task-specific fine tuning could lead to fundamental misalignment in other domains, paper-clip optimizer style. My company fine-tunes LLMs for time series analysis tasks and they are being taken pretty far out of the domain of their pre-training data, so if all of a sudden you take one of these models and prompt it with natural language as opposed to the specially formatted time series data it is expecting the results are hard to reason about... yes it still speaks English but what has it lost? I would be more surprised/worried if misalignment arose that way.