| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by GistNoesis 586 days ago

a^nb^n can definitely be expressed and recognized with a transformer.

A transformer (with relative invariant positional embedding) has full context so can see the whole sequence. It just has to count and compare.

To convince yourself, construct the weights manually.

First layer :

zeros the character which are equal to the previous character.

Second layer :

Build a feature to detect and extract the position embedding of the first a. a second feature to detect and extract the position embedding of the last a, a third feature to detect and extract the position embedding of the first b, a fourth feature to detect and extract the position embedding of the last b,

Third layer :

on top that check whether (second feature - first feature) == (fourth feature - third feature).

The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

If you train by only showing example with varying n, there probably isn't inductive bias to make it converge naturally towards the optimal solution you can construct by hand. But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You can't deduce much from negative results in research beside it requiring more work.

1 comments

YeGoblynQueenne 586 days ago

>> The paper doesn't distinguish between what is the expressive capability of the model, and the finding the optimum of the model, aka the training procedure.

They do. That's the whole point of the paper: you can set a bunch of weights manually like you suggest, but can you learn them instead; and how? See the Introduction. They make it very clear that they are investigating whether certain concepts can be learned by gradient descent, specifically. They point out that earlier work doesn't do that and that gradient descent is an obvious bit of bias that should affect the ability of different architectures to learn different concepts. Like I say, good work.

>> But you can probably train multiple formal languages simultaneously, to make the counting feature emerge from the data.

You could always try it out yourself, you know. Like I say that's the beauty of grammars: you can generate tons of synthetic data and go to town.

>> You can't deduce much from negative results in research beside it requiring more work.

I disagree. I'm a falsificationist. The only time we learn anything useful is when stuff fails.

GistNoesis 586 days ago

Gradient descent usually get stuck in local minimum, it depends on the shape of the energy landscape, that's expected behavior.

The current wisdom is that by optimizing for multiple tasks simultaneously, it makes the energy landscape smoother. One task allow to discover features which can be used to solve other tasks.

Useful features that are used by many tasks can more easily emerge from the sea of useless features. If you don't have sufficiently many distinct tasks the signal doesn't get above the noise and is much harder to observe.

That the whole point of "Generalist" intelligence in the scaling hypothesis.

For problems where you can write a solution manually you can also help the training procedure by regularising your problem by adding the auxiliary task of predicting some custom feature. Alternatively you can "Generatively Pretrain" to obtain useful feature, replacing custom loss function by custom data.

The paper is a useful characterisation of the energy landscape of various formal tasks in isolation, but doesn't investigate the more general simpler problem that occur in practice.