|
|
|
|
|
by hodgehog11
305 days ago
|
|
True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale. If you present an appropriately large and well-trained model with in-context patterns, it often does a decent job, even when it isn't trained on them. By nerfing the model (4 layers), the conclusion is foregone. I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set. |
|