|
|
|
|
|
by wlib
455 days ago
|
|
The filler tokens actually do make them think more. Even just allowing the models to output "." until they are confident enough to output something increases their performance. Of course, training the model to do this (use pause tokens) on purpose works too: https://arxiv.org/pdf/2310.02226 |
|
If however there were a way to keep passing hidden states to future autoregressive steps and not just the final tokens from the previous step, that might give the model true "thinking" time.