|
|
|
|
|
by sarreph
5 hours ago
|
|
The author alludes to it but the defence to this is seemingly insurmountable at the moment because we’re ostensibly operating LLMs on a single channel — their inner, subconscious voice. Right? Interacting with an LLM is a bit like seeing the output of an Inside Out (the Disney movie) scene. Or it’s a bit like a human brain that we’re providing tool call access and introspection with some kind of advanced neuralink. But - like the author says - _we know_ our inside voice from the outside world, because we’re embodied. Is there something we can do here by attempting to bifurcate internal and external systems? Like a conscious and subconscious stream of information on two separate bands? If the model somehow knew its User was not it because it was clearly an external signal, then the attack documented here would be about as effective as a Jedi mind trick without the Force. |
|
Something like
where u is english text and t is an internal "thinking" loop.Currently we train models by feeding them sample text and then tweaking the weights until the predicted next token matches the expected next token from the input text. This works well because LLM corps were able to steal vast quantities of sample text from the internet.
But, if you also have an internal reasoning loop, how do you train that part? The internal loop is not necessarily going to produce one clean token for a given input like an LLM does, and the time scale isn't going to be the same (meaning an internal loop might be expected to run 10 times for every one token produced). There is no "correct next token" for the internal reasoning loop. This is roughly the same training issue that killed RNNs.