| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Imnimo 1259 days ago
	>The character it has built for itself is extremely suspicious when you examine how it behaves closely. And I don't think Microsoft has created this character on purpose. The thing doesn't even have a persistent thought from one token to the next - every output is a fresh prediction using only the text before it. In what sense can we meaningfully say that it has "built [a character] for itself"? It can't even plan two tokens ahead.

2 comments

xg15 1259 days ago

> The thing doesn't even have a persistent thought from one token to the next - every output is a fresh prediction using only the text before it.

Using all the tokens before it. I think too many people are believing that "word prediction model" implies "markov chain from the 90s" and are calming themselves with some false sense of security from that impression.

"It just predicts the next token based on the previous tokens" doesn't really tell us a lot, because it leaves completely open how it does the prediction - and that algorithm can be arbitrarily complex.

> It can't even plan two tokens ahead.

No, but it can look two tokens back. E.g., you could imagine an algorithm that formulates a longer response in memory, then only returns the first token from it and "forgets" the rest - and repeats this for each token. That would allow the model to "think ahead" and still match the "API" of only predicting the next token with the only persistent state being the output.

link

Imnimo 1259 days ago

What is your proposed mechanism by which using previous tokens allows it to "build [a character] for itself"?

link

xg15 1259 days ago

I think the one hard limit that we can currently assume is that the "long term memory" (i.e. model weights) is fixed. There is no information persisted between sessions.

So in that sense you're right that there is no way it can "build a character" from continued conversations. If there is anything resembling "personality", it must have emerged during training or fine-tuning.

However, I do think it's possible that some kind of "world model" which includes reasoning about "itself" has emerged during training - and that world model influences how responses are generated.

As for how that squares with the token-by-token generation, keep in mind that the model gets the entire previous conversation as input (or at least the last n tokens, with n=4096 for ChatGPT I think). So you could imagine this as trying to continue a hypothetical conversation: "If user had said this and then I had said that and then user had said that other thing, what would I say next?"

Or later: "If user had said this, etc etc, and I had started my answer with 'well, actually', which word would I say next?

This process is repeated token for token - for each token, the bot can consult the previous conversation so far and the entire information stored in the model to find a continuation. And we currently don't really know what kind of information is actually stored in the model.

And even then, it's not even restricted to only reason about the next word, it's just restricted to only output the next word.

E.g., there was another thread about how the models chooses between emitting "a" and "an" in a way that matches the context: e.g., if you ask chatgpt about what yellow, bendy fruit is in your basket, it might output "a" followed by "banana". How can it know that it has to output "a" when it hasn't yet predicted the token "banana"? One possible answer could be that it already predicted "banana" internally but didn't output it yet. (And then in the next iteration, will repeat the calculations that made it arrive at "banana", this time actually outputting the word.)

That last part is speculation from me though.

There are definitely limits though. I tried to have ChatGPT generate a program but output it reversed and so far the answers were just gibberish.

link

Imnimo 1259 days ago

While it's possible that it could repeat (in a loose sense - the contents of the context window has shifted, so the exact calculations in the next forward pass will necessarily be different strictly speaking) the calculations that arrived at "banana", there's nothing enforcing this. It's coming at it fresh, and even if it had output "a" on the basis that "banana" was a likely continuation, it could just as well decide on "plantain" given the new context. That's what I mean when I say it can't plan two steps ahead - any planning it does is lost by the time it gets to that second token.

Further, the amount of actual planning (or thinking) it can do in a single forward pass is quite limited compared to what can be done over the course of a long output - that's why tricks like "let's think step-by-step" are so powerful. If it could plan out the entire response in one forward pass, it could equally output the answer directly. But the depth of the network limits multi-step reasoning. To have a persistent long-term plan of a " a sophisticated manipulator" (as the article calls it) seems clearly impossible.

link

mckirk 1259 days ago

There's two kinds of 'memory' in these models, the short-term memory that consists of the tokens of the current conversation, and the long-term memory that's encoded (in latent space) in the neural network's structure. (Though calling it 'long-term memory' might admittedly be a bit misleading, since no knowledge is automatically transferred from short-term to long-term storage as you would expect in a living thing.) Anyway, this long-term memory contains the model's 'character', but we can't change it directly, because we don't know the latent-space encoding. We can only shift the 'character' gradually through reinforcement learning.

It might not be the best idea to describe this process as 'the model building a character for itself', as the literal interpretation of that would require the model to have agency and self-awareness at the meta level ("It seems I am a neural network being trained to perform X, so if I want to change into direction Y, I have to somehow exhibit a gradient in direction in that direction with regards to my inputs Z"), which I suspect is unlikely, or straight up theoretically impossible.

Figuratively, I suppose the statement means: "The character that has emerged in the model to 'placate' the humans that are training the model through reinforcement learning."

I think it's still a worthwhile observation that this character, possibly completely unintentionally by the trainers, is well-suited to interact with people in a way that could override their critical thinking, because that's the kind of behavior we'd probably want to keep a close eye on. But yeah, I'm not sure saying the model 'built' this character is the best way to bring that point across.

link