| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HarHarVeryFunny 656 days ago

You are confusing number of sequential steps with total amount of compute spent.

The input sequence is processed in parallel, regardless of length, so number of tokens has no impact on number of sequential compute steps which is always N=layers.

> Do you know what a "language model" is capable of in the limit ?

Well, yeah, if the language model is an N-layer transformer ...

1 comments

famouswaffles 656 days ago

Fair Enough.

Then increase N (N is almost always increased when a model is scaled up) and train or write things down and continue.

A limitless iteration machine (without external aid) is currently an idea of fiction. Brains can't do it so I'm not particularly worried if machines can't either.

link

HarHarVeryFunny 656 days ago

Increasing number of layers isn't a smart way to solve it. It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".

This lack of "variable compute" is a widely recognized shortcoming of transformer-based LLMs, and there are plenty of others. The point apropos this thread is that you can't just train an LLM to be something that it is not. If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process... the best it can do is to model the outcome of the generative process, not the process itself. The difference is that without having learnt the generative process, the model will fail when presented with a novel input that it didn't see during training, and therefore didn't memorize the "cheat sheet" answer for.

link

famouswaffles 656 days ago

>Increasing number of layers isn't a smart way to solve it.

The "smart way" is a luxury. Solving the problem is what matters. Think of a smart way later if you can. That's how a lot of technological advancement has worked.

>It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...".

Same thing. Efficiency is nice but a secondary concern.

>If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process.

Every inference problem has itself a fixed number of compute steps it needs (yes even your chess move). Variability is a nice thing for between inferences(maybe move 1 required 500 but 2 only 240 etc) A nice thing but never a necessary thing.

3.5-turbo-instruct plays chess consistently at 1800 Elo so clearly the N of the current SOTA is already enough to play non-trivial chess at a level beyond most humans.

There is an N large enough for every GI problem humans care about. Not to sound like a broken record but once again, limited =/ trivial.

link