|
|
|
|
|
by moconnor
340 days ago
|
|
A very long way of saying "during pretraining let the models think before continuing next-token prediction and then apply those losses to the thinking token gradients too." It seems like an interesting idea. You could apply some small regularisation penalty to the number of thinking tokens the model uses. You might have to break up the pretraining data into meaningfully-paritioned chunks. I'd be curious whether at large enough scale models learn to make use of this thinking budget to improve their next-token prediction, and what that looks like. |
|