Hacker News new | ask | show | jobs
by pu_pe 96 days ago
You didn't actually give an example of what the issue with next token prediction is. You just mentioned current constraints (ie generalization and learning are difficult, needs mountains of data to train, can't play chess very well) that are not fundamental problems. You can trivially train a transformer to play chess above the level any human can play at, and they would still be doing "next token prediction". I wouldn't be surprised if every single thing you list as a challenge is solved in a few years, either through improvement at a basic level (ie better architectures) or harnessing.

We don't know how human brains produce intelligence. At a fundamental level, they might also be doing next token prediction or something similarly "dumb". Just because we know the basic mechanism of how LLMs work doesn't mean we can explain how they work and what they do, in a similar way that we might know everything we need to know about neurons and we still cannot fully grasp sentience.

2 comments

I use the chess example because it’s especially instructive. It would NOT be trivial to train an LLM to play chess, next token prediction breaks down when you have so many positions to remember and you can’t adequately assign value to intermediate positions. Chess bots work by being trained on how to assign value to a position, something fundamentally different than what an LLM is doing.

A simpler example — without tool use, the standard BPE tokenization method made it impossible for state of the art LLMs to tell you how many ‘r’s are in strawberry. This is because they are thinking in tokens, not letters and not words. Can you think of anything in our intelligence where the way we encode experience makes it impossible for us to reason about it? The closest thing I can come to is how some cultures/languages have different ways of describing color and as a result cannot distinguish between colors that we think are quite distinct. And yet I can explain that, think about it, etc. We can reason abstractly and we don’t have to resort to a literal deus ex machina to do so.

Not being able to explain our brain to you doesn’t mean I can’t notice things that LLMs can’t do, and that we can, and draw some conclusions.

There are chess engines based on transformers, even DeepMind released one [1]. It achieved ~2900 Elo. It does have peculiarities for example in the endgame that are likely derived from its architecture, though I think it definitely qualifies as an example of the fact that simply because something is a next token predictor doesn't mean it cannot perform tasks that require intelligence and planning.

The r in strawberry is more of a fundamental limitation of our tokenization procedures, not the transformer architecture. We could easily train a LLM with byte-size tokens that would nail those problems. It can also be easily fixed with harnessing (ie for this class of problems, write a script rather than solve it yourself). I mean, we do this all the time ourselves, even mathematicians and physicists will run to a calculator for all kinds of problems they could in principle solve in their heads.

[1] https://arxiv.org/abs/2402.04494

But chess models aren't trained the same way LLMs are trained. If I am not mistaken, they are trained directly from chess moves using pure reinforcement learning, and it's definitely not trivial as for instance AlphaZero took 64 TPUs to train.
You can train them in a very similar way.

Modern LLMs often start at "imitation learning" pre-training on web-scale data and continue with RLVR for specific verifiable tasks like coding. You can pre-train a chess engine transformer on human or engine chess parties, "imitation learning" mode, and then add RL against other engines or as self-play - to anneal the deficiencies and improve performance.

This was used for a few different game engines in practice. Probably not worth it for chess unless you explicitly want humanlike moves, but games with wider state and things like incomplete information benefit from the early "imitation learning" regime getting them into the envelope fast.

I meant trivial in the sense it's a solved problem, I'm sure it still costs a non-negligible amount of money to train it. See for example the chess transformer built by DeepMind a couple of years ago which I referred to in a sibling comment [1].

[1] https://arxiv.org/abs/2402.04494

Thank you for the link.

I admit, my knowledge of reinforcement learning is a bit outdated so it seemed to me that it was unattainable for a non-specialized model to train efficiently on something like chess, which has a huge state space.