| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by valine 391 days ago
	I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space. It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens.

4 comments

woadwarrior01 391 days ago

> Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space.

That's essentially the core idea in Coconut[1][2], to keep the reasoning traces in a continuous space.

[1]: https://arxiv.org/abs/2412.06769

[2]: https://github.com/facebookresearch/coconut

link

x_flynn 390 days ago

What the model is doing in latent space is auxilliary to anthropomorphic interpretations of the tokens, though. And if the latent reasoning matches a ground-truth procedure (A*), then we'd expect it to be projectable to semantic tokens, but it isn't. So it seems the model has learned an alternative method for solving these problems.

link

valine 390 days ago

You’re thinking about this like the final layer of the model is all that exists. It’s highly likely reasoning is happening at a lower layer, in a different latent space that can’t natively be projected into logits.

link

refulgentis 390 days ago

It is worth pointing out that "latent space" is meaningless.

There's a lot of stuff that makes this hard to discuss, ex. "projectable to semantic tokens" you mean "able to be written down"...right?

Something I do to make an idea really stretch its legs is reword it in Fat Tony, the Taleb character.

Setting that aside, why do we think this path finding can't be written down?

Is Claude/Gemini Plays Pokemon just an iterated A* search?

link

aiiizzz 390 days ago

Is that really true? E.g. anthropic said that the model can make decisions about all the tokens, before a single token is produced.

link

valine 390 days ago

That’s true yeah. The model can do that because calculating latents is independent of next token prediction. You do a forward pass for each token in your sequence without the final projection to logits.

link

jacob019 390 days ago

So you're saying that the reasoning trace represents sequential connections between the full distribution rather than the sampled tokens from that distribution?

link

valine 390 days ago

The lower dimensional logits are discarded, the original high dimensional latents are not.

But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

link

jacob019 390 days ago

I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.

link

valine 390 days ago

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

link

pyinstallwoes 390 days ago

Where does it happen ?

link

valine 390 days ago

My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

link

bcoates 390 days ago

Either I'm wildly misunderstanding or that can't possibly be true--if you sample at high temperature and it chooses a very-low probability token, it continues consistent with the chosen token, not with the more likely ones

link

valine 390 days ago

Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

link