Hacker News new | ask | show | jobs
by tjbai 558 days ago
The last hidden state is just the output embedding after N residual layers, e.g. input embedding + res1 + res2 + ...

There's typically an "unembedding layer"/"classification head" that uses this hidden state to produce a softmax distribution over the LLM's vocabulary. In this case, we can think of this as "snapping" the hidden state into a single token and feeding that token into the next position of the autoregressive LLM.

In this sense, the last hidden state _does_ augment the next input. The authors simply propose directly feeding this hidden state into the next step rather than reducing it into a single token—thus, reasoning in continuous latent space rather than discrete token space.

2 comments

Moreover “snapping” the hidden state to a token is akin to quantization. It’s lossy. By staying in latent space the model can “reason” at “full resolution” without discretization noise.
Sometimes discretization introduces interesting behavior though. Compare for example the logistic map and it's chaotic regime with the simplicity of the logistic ODE. Another example would be quantum mechanics compared to classical mechanics and determinism. The Poincare Conjecture was only interesting for n=3 due to too much connectivity in higher dimensions. Wouldn't it be interesting if consciousness only arose in such a discretized form, a case of incidental complexity and chaos introduced as the result of topological non-triviality from quantization?

Don't forget, non-linearity is fundamental to the whole process, otherwise you'd just have one large linear transformation. Maybe there's a similar role for discretization? :shrug:

Useful information about conceptual relationships and procedure can be captured in the LM head, so there is also potential lossiness when short-circuiting it.
Wow this was the explanation that made it all click for me. Thanks so much!