| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mckirk 1260 days ago

There's two kinds of 'memory' in these models, the short-term memory that consists of the tokens of the current conversation, and the long-term memory that's encoded (in latent space) in the neural network's structure. (Though calling it 'long-term memory' might admittedly be a bit misleading, since no knowledge is automatically transferred from short-term to long-term storage as you would expect in a living thing.) Anyway, this long-term memory contains the model's 'character', but we can't change it directly, because we don't know the latent-space encoding. We can only shift the 'character' gradually through reinforcement learning.

It might not be the best idea to describe this process as 'the model building a character for itself', as the literal interpretation of that would require the model to have agency and self-awareness at the meta level ("It seems I am a neural network being trained to perform X, so if I want to change into direction Y, I have to somehow exhibit a gradient in direction in that direction with regards to my inputs Z"), which I suspect is unlikely, or straight up theoretically impossible.

Figuratively, I suppose the statement means: "The character that has emerged in the model to 'placate' the humans that are training the model through reinforcement learning."

I think it's still a worthwhile observation that this character, possibly completely unintentionally by the trainers, is well-suited to interact with people in a way that could override their critical thinking, because that's the kind of behavior we'd probably want to keep a close eye on. But yeah, I'm not sure saying the model 'built' this character is the best way to bring that point across.