| >Increasing number of layers isn't a smart way to solve it. The "smart way" is a luxury. Solving the problem is what matters. Think of a smart way later if you can. That's how a lot of technological advancement has worked. >It order to be able to reason effectively and efficiently the model needs to use as much, or as little, compute as needed for a given task. Completing "1+1=" should take less compute steps than "A winning sequence for white here is ...". Same thing. Efficiency is nice but a secondary concern. >If the generating process required variable compute (maybe 1000's of steps) - e.g. to come up with a chess move - then no amount of training can make the LLM converge to model this generative process. Every inference problem has itself a fixed number of compute steps it needs (yes even your chess move). Variability is a nice thing for between inferences(maybe move 1 required 500 but 2 only 240 etc) A nice thing but never a necessary thing. 3.5-turbo-instruct plays chess consistently at 1800 Elo so clearly the N of the current SOTA is already enough to play non-trivial chess at a level beyond most humans. There is an N large enough for every GI problem humans care about. Not to sound like a broken record but once again, limited =/ trivial. |