|
|
|
|
|
by gwern
1062 days ago
|
|
> I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do. Yes, and maybe the occasional generation of the complete boardstate to date, because you don't want to leave the boardstate implicit and require it to be reconstructed within each forward pass - that's 'using up serial computations' that a Transformer can't afford. But if you periodically serialize the best-answer-to-date, you are more likely to be able to bite off a chewable chunk. > My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku A Transformer is not much different from an unrolled RNN without weight-sharing, so for any specific sudoku size, there should be some depth which does allow the worst-case amount of backtracking or other solution to the problem. (One way to show this would be to use the RASP programming language to program such a solver.) It's just it'd probably be bigger/deeper than you have available now. |
|
I was assuming given a trained Transformer, you wouldn't know how many effective "steps of computation" it contained, and so would probably have to resort to CoT.