| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by valine 386 days ago
	The lower dimensional logits are discarded, the original high dimensional latents are not. But yeah, the LLM doesn’t even know the sampler exists. I used the last layer as an example, but it’s likely that reasoning traces exist in the latent space of every layer not just the final one, with the most complex reasoning concentrated in the middle layers.

2 comments

jacob019 386 days ago

I don't think that's accurate. The logits actually have high dimensionality, and they are intermediate outputs used to sample tokens. The latent representations contain contextual information and are also high-dimensional, but they serve a different role--they feed into the logits.

link

valine 386 days ago

The dimensionality I suppose depends on the vocab size and your hidden dimension size, but that’s not really relevant. It’s a single linear projection to go from latents to logits.

Reasoning is definitely not happening in the linear projection to logits if that’s what you mean.

link

pyinstallwoes 386 days ago

Where does it happen ?

link

valine 385 days ago

My personal theory is that it’s an emergent property of many attention heads working together. If each attention head is a bird, reasoning would be the movement of the flock.

link

bcoates 386 days ago

Either I'm wildly misunderstanding or that can't possibly be true--if you sample at high temperature and it chooses a very-low probability token, it continues consistent with the chosen token, not with the more likely ones

link

valine 386 days ago

Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.

link