Hacker News new | ask | show | jobs
by kcorbitt 473 days ago
One of the authors here. Happy to answer any questions about our methods/results!
5 comments

Please define an acronym the first time you use it in the body text. I had to scroll about 20% the way through your article just to understand the title.
We updated the first paragraph to define the acronym. Thanks again for the feedback!
Great point! Thanks for the feedback.
Can you elaborate on this point:

“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”

In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?

No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.

We only tested this with the 14B model. You can see the run here:

https://wandb.ai/bradhilton/rl-experiments/runs/062

Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.

Thanks.
Any hypotheses on why the performance dropped suddenly while training?
Hi, other author here. I think the models converged on shallow/greedy strategies that improved performance up to a point, but are ultimately shortsighted, especially for harder puzzles.

Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.

Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.

Other ideas to explore include:

- Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero

Is my understanding here correct? Could this be the reason?

https://news.ycombinator.com/item?id=43287312

As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. It's as if its fate has already been sealed many iterations ago.
Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.
Do you have any other logic puzzles you could use to see if the performance generalises?
To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.

I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).

The question to me if you can call that deduction in that case. Isn't it just a type of pattern matching that fits this particular task?
Once the problem gets narrow enough, do you risk training a model that reinvents a straightforward classic algorithm at far higher cost?
Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.