|
|
|
|
|
by valine
391 days ago
|
|
I think it’s helpful to remember that language models are not producing tokens, they are producing a distribution of possible next tokens. Just because your sampler picks a sequence of tokens that contain incorrect reasoning doesn't mean a useful reasoning trace isn’t also contained within the latent space. It’s a misconception that transformers reason in token space. Tokens don’t attend to other tokens. High dimensional latents attend to other high dimensional latents. The final layer of a decoder only transformer has full access to entire latent space of all previous latents, the same latents you can project into a distribution of next tokens. |
|
That's essentially the core idea in Coconut[1][2], to keep the reasoning traces in a continuous space.
[1]: https://arxiv.org/abs/2412.06769
[2]: https://github.com/facebookresearch/coconut