To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.
I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).
Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.
I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).