|
|
|
|
|
by tjbai
558 days ago
|
|
The last hidden state is just the output embedding after N residual layers, e.g. input embedding + res1 + res2 + ... There's typically an "unembedding layer"/"classification head" that uses this hidden state to produce a softmax distribution over the LLM's vocabulary. In this case, we can think of this as "snapping" the hidden state into a single token and feeding that token into the next position of the autoregressive LLM. In this sense, the last hidden state _does_ augment the next input. The authors simply propose directly feeding this hidden state into the next step rather than reducing it into a single token—thus, reasoning in continuous latent space rather than discrete token space. |
|