| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by shawntan 1063 days ago

I recommend reading the theoretical work on the computational capabilities of Transformers: https://twitter.com/lambdaviking/status/1630581475425828864 References to other work can probably be found in that article.

Shameless plug to my own blogpost about this: https://blog.wtf.sg/posts/2023-02-03-the-new-xor-problem/

TL;DR: The theoretical class of problems that Transformers can solve (without Chain-of-Thought style responses) is fairly limited. Generally, universal approximation proofs rely on infinite precision assumptions, which are not practical in reality. Empirical results also show very limited capabilities when tested on certain formal languages.

In the Sudoku case, the problem-length is limited, so one could conceptually make a large enough model that could memorise all solutions to all possible combinations of permissible sudoku boards, which could then just access and read out the solutions.

1 comments

gwern 1063 days ago

> The theoretical class of problems that Transformers can solve (without Chain-of-Thought style responses) is fairly limited.

Which is irrelevant because how would a Transformer emit a complete Sudoku solution in a single forward-pass/token in the first place?

link

shawntan 1063 days ago

I suppose you mean in order to give the answer to a Sudoku puzzle, you'd need a string of tokens anyway: [(x,y) grid coordinates], [digit].

I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do.

My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku, you'd need the intermediate CoT "thinking out loud"

link

gwern 1062 days ago

> I think if we're getting specific to this particular Sudoku example, the CoT would probably involve a trace of the entire filling-in and backtracking steps that a solver would do.

Yes, and maybe the occasional generation of the complete boardstate to date, because you don't want to leave the boardstate implicit and require it to be reconstructed within each forward pass - that's 'using up serial computations' that a Transformer can't afford. But if you periodically serialize the best-answer-to-date, you are more likely to be able to bite off a chewable chunk.

> My guess is that the straightforward output of the exact solution, even though it requires several tokens, wouldn't be enough to do the constraint resolution in Sudoku

A Transformer is not much different from an unrolled RNN without weight-sharing, so for any specific sudoku size, there should be some depth which does allow the worst-case amount of backtracking or other solution to the problem. (One way to show this would be to use the RASP programming language to program such a solver.) It's just it'd probably be bigger/deeper than you have available now.

link

shawntan 1062 days ago

Right, I see your point. Since Sudoku is fixed-size, you can always construct a Transformer with the worse-case depth. That makes sense.

I was assuming given a trained Transformer, you wouldn't know how many effective "steps of computation" it contained, and so would probably have to resort to CoT.

link