| HN Mirror

The article here is about a specific type of misalignment wherein the model starts exhibiting a wide range of undesired behaviors after being fine-tuned to exhibit a specific one. They are calling this 'emergent misalignment.' It's an empirical science about a specific AI paradigm (LLMs), which didn't exist in 2008. I guess this is just semantics, but to me it seems fair to call this a new science, even if it is a subfield of the broader topic of alignment that these papers pioneered theoretically.

But semantics phooey. It's interesting to read these abstracts and compare the alignment concerns they had in 2008 to where we are now. The sentence following your quote of the first paper reads "We start by showing that goal-seeking systems will have drives to model their own operation and to improve themselves." This was a credible concern 17 years ago, and maybe it will be a primary concern in the future. But it doesn't really apply to LLMs in a very interesting way, which is that we somehow managed to get machines that exhibit intelligence without being particularly goal-oriented. I'm not sure many people anticipated this.