|
|
|
|
|
by reliablereason
2 hours ago
|
|
Is the thinking even done in real tokens? I thought it was done using the pure residual stream. That is instead of collapsing the residual stream to a token you treat the final layers output as a vector of size d_model and use that as input for the next position in the transformer. If that is the case thinking is not visible to us as users due to it not being done in text. |
|
EDIT:
They link to a Meta paper from 2024/2025 though: https://arxiv.org/pdf/2412.06769/.