Hacker News new | ask | show | jobs
by anon373839 544 days ago
Agreed.

But also, I think the highly anthropomorphic framing (“the model is aware”, “the model believes”, “the model planned”) obscures the true nature of the experiments.

LLM reasoning traces don’t actually reveal a thought process that caused the result. (Perhaps counterintuitive, since these are autoregressive models.) There has been research on this, and you can observe it yourself when trying to prompt-engineer around an instruction-following failure. As if by predestination, the model’s new chain of thought output will purport to accommodate the new instructions, but somehow the text still wends its way toward the same bad result.

1 comments

This right here. Try running prompt-engineering/injection automation with iterative adjustments and watch how easy it is to select tokens that eventually produce the desired output, good or bad. It isn't AGI, its next token prediction working as intended.