| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kcorbitt 473 days ago
	One of the authors here. Happy to answer any questions about our methods/results!

5 comments

malcolmgreaves 473 days ago

Please define an acronym the first time you use it in the body text. I had to scroll about 20% the way through your article just to understand the title.

link

bradhilton 473 days ago

We updated the first paragraph to define the acronym. Thanks again for the feedback!

link

bradhilton 473 days ago

Great point! Thanks for the feedback.

link

pama 473 days ago

Can you elaborate on this point:

“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”

In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?

link

bradhilton 473 days ago

No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.

We only tested this with the 14B model. You can see the run here:

https://wandb.ai/bradhilton/rl-experiments/runs/062

Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.

link

pama 473 days ago

Thanks.

link

bydgjohc 473 days ago

Any hypotheses on why the performance dropped suddenly while training?

link

bradhilton 473 days ago

Hi, other author here. I think the models converged on shallow/greedy strategies that improved performance up to a point, but are ultimately shortsighted, especially for harder puzzles.

Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.

Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.

Other ideas to explore include:

- Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero

link

kiratp 473 days ago

Is my understanding here correct? Could this be the reason?

https://news.ycombinator.com/item?id=43287312

link

bradhilton 473 days ago

As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. It's as if its fate has already been sealed many iterations ago.

link

mdp2021 473 days ago

Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.

link

snovv_crash 473 days ago

Do you have any other logic puzzles you could use to see if the performance generalises?

link

kcorbitt 473 days ago

To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.

I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).

link

shinryuu 472 days ago

The question to me if you can call that deduction in that case. Isn't it just a type of pattern matching that fits this particular task?

link

ekidd 473 days ago

Once the problem gets narrow enough, do you risk training a model that reinvents a straightforward classic algorithm at far higher cost?

link

bradhilton 473 days ago

Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.

link