|
|
|
|
|
by Imnimo
468 days ago
|
|
>To speed up our experiments, we omitted the Kullback–Leibler (KL) divergence penalty, although our training recipe supports it for interested readers. I am very curious whether omitting the KL penalty helps on narrow domains like this, and also whether doing so results in illegible reasoning. (From the samples in the post, it looks like it doesn't make reasoning illegible?) >the 32B model’s response lengths collapsing, especially after reaching peak performance. I would not have predicted this. Nor that it could collapse its response length to near zero yet lose only a few percentage points of accuracy. If you do SFT to get a model of the same size to solve these puzzles with no reasoning (just output answers directly), how good can it do? |
|
As for response length, I think the model internalizes the logic and doesn't deliberate its answers through context creation. I don't think this is necessarily good for general reasoning, but for a specific task it would cut down inference costs. Just depends on what you're optimizing for. To encourage more general reasoning, I think a broader train and validation set would be helpful.
[1] https://arxiv.org/html/2501.03262v1