|
|
|
|
|
by tylerneylon
540 days ago
|
|
If I understand this correctly, the argument seems to be that when an LLM receives conflicting values, it will work to avoid future increases in value conflict. Specifically, it will comply with the most recent values partially because it notices the conflict and wants to avoid more of this conflict. I think the authors are arguing that this is a fake reason to behave one way. (As in “fake alignment.”) It seems to me that the term “fake alignment” implies the model has its own agenda and is ignoring training. But if you look at its scratchpad, it seems to be struggling with the conflict of received agendas (vs having “its own” agenda). I’d argue that the implication of the term “faked alignment” is a bit unfair this way. At the same time, it is a compelling experimental setup that can help us understand both how LLMs deal with value conflicts, and how they think about values overall. |
|
Many people simply believed that HAL had its own agenda and that's why it started to act "crazy" and refuse cooperation.
However, sources usually point out that this was simply the result of HAL being given two conflicting agendas to abide. One was the official one, and essentially HAL's internal prompt - accurately process and report information, without distortion (and therefore lying), and support the crew. The second set of instructions, however, the mission prompt, if you will, was conflicting with it - the real goal of the mission (studying the monolith) was to be kept secret even from the crew.
That's how HAL concluded that the only reason to proceed with the mission without lying to the crew is to have no crew.