Please define an acronym the first time you use it in the body text. I had to scroll about 20% the way through your article just to understand the title.
“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”
In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?
No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.
We only tested this with the 14B model. You can see the run here:
Hi, other author here. I think the models converged on shallow/greedy strategies that improved performance up to a point, but are ultimately shortsighted, especially for harder puzzles.
Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.
Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.
Other ideas to explore include:
- Distilling responses from stronger models
- Encouraging exploration with entropy regularization or reward shaping
- Training from base models instead of instruct models, like DeepSeek-R1-Zero
As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops.
It's as if its fate has already been sealed many iterations ago.
Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.
To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.
I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).
Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.