|
|
|
|
|
by geraneum
410 days ago
|
|
> The model definitely iteratively built up (useful and correct even) text that wasn't directly in the training data The text is highly likely in training data, as it’s textbook arithmetic instructions. It’s the number that is probably not there. Simple arithmetic is one of the verifiable operation types (truths) with a straightforward reward function used to train CoT models. In your example, what’s interesting to me is improving LLM inference with RL that can result in such wonderful outcomes, but that’s perhaps a different question. |
|