| a^nb^n can definitely be expressed and recognized with a transformer. A transformer (with relative invariant positional embedding) has full context so can see the whole sequence. It just has to count and compare. To convince yourself, construct the weights manually. First layer : zeros the character which are equal to the previous character. Second layer : Build a feature to detect and extract the position embedding of the first a.
a second feature to detect and extract the position embedding of the last a,
a third feature to detect and extract the position embedding of the first b,
a fourth feature to detect and extract the position embedding of the last b, Third layer : on top that check whether (second feature - first feature) == (fourth feature - third feature). The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure. If you train by only showing example with varying n, there probably isn't inductive bias to make it converge naturally towards the optimal solution you can construct by hand. But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data. You can't deduce much from negative results in research beside it requiring more work. |
They do. That's the whole point of the paper: you can set a bunch of weights manually like you suggest, but can you learn them instead; and how? See the Introduction. They make it very clear that they are investigating whether certain concepts can be learned by gradient descent, specifically. They point out that earlier work doesn't do that and that gradient descent is an obvious bit of bias that should affect the ability of different architectures to learn different concepts. Like I say, good work.
>> But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.
You could always try it out yourself, you know. Like I say that's the beauty of grammars: you can generate tons of synthetic data and go to town.
>> You can't deduce much from negative results in research beside it requiring more work.
I disagree. I'm a falsificationist. The only time we learn anything useful is when stuff fails.