|
|
|
|
|
by hgoel
58 days ago
|
|
The responses to this seem unnecessarily hyperbolic. These tests are interesting even with the understanding that the AI is just reciprocating its training. It doesn't matter if the model is conscious or self aware if it still goes off the rails breaking things when prompted in this way. As the article linked at the end of the tweet thread (https://www.arimlabs.ai/writing/loss-of-control) puts it, this is a class of vulnerability distinct from hallucination or prompt injection. The "AI apocalypse" bit was unnecessary in the title though, really doesn't match the message of the text. Reminds me of a (computerphile?) video I watched some time before the LLM revolution, discussing the challenge of aligning AI towards specific goals, if you set the reward for the emergency shutoff button higher than or equal to the primary objective, the AI is encouraged to immediately press the button itself, but if you the reward lower, it's encouraged to prevent you from pressing the button. |
|
That tells you how the researchers are thinking of not only the results but the experiment as well. You may be right that the reason the models behave this way is secondary to the fact that they do, but that’s not how the researchers are asking us to look at it. They ran the experiment 300 times, it sometimes did what they thought it would, and then they framed it as if that’s all that matters.