| HN Mirror

>> Could you elaborate a bit more on why you think training a transformer only on chess moves(in algebraic notation, yes. Algebraic notation is the one that says <piece><square>, roughly speaking) wouldn't work? I'm not sure I understand.

Oh no, I think it would work. Just that it would be impossible for one person to train a Transformer to play good chess just by predicting the next move. Now that I think about it, ChatGPT's model is trained not only on algebraic notation (thanks!) but also on analyses of games, so the natural language in its initial prompt also directs it to play a certain ... kind? style? of game. I'm guessing anyway.

>> I've just been working on my own crazy chess AI ideas for a long while now and I was taken aback by the fact that GPT seems able to occasionally "find" long tactical sequences even in positions that have not ocurred before in known games.

Well, what GPT is doing is, fundamentally, compression. Normally we think of compression as what happens when we zip a file, right? You zip a file, then you unzip it, and you get the same file back. Forgetting about lossless and lossy information for a second, it is also possible to compress information so that you can uncompresss it into variations of the original.

Here's a very simple example: Suppose I decided to store a parse of the sentence "the cat eats a bat" as a Context-Free grammar.

  sentence --> noun_phrase, verb_phrase.
  noun_phrase --> det, noun.
  verb_phrase --> verb, noun_phrase.
  det --> [the].
  det --> [a].
  noun --> [cat].
  noun --> [bat].
  verb --> [eats].

Now that is a grammar that accepts, and generates, not only the initial sentence, "the cat eats a bat", but also the sentences: "the cat eats a cat", "the cat eats the cat", "a cat eats the cat", "the bat eats a cat", "the bat eats a bat", "a cat eats a cat", "a bat eats a bat" and so on.

So we started with a grammar that represents one string, and we ended up with a grammar that can spit out a whole bunch of strings that are not the original string. That's what I mean by "compress[ing] information so that you can uncompress it into variations of the original". And that's why they can generate never-before seen sequences, like you say. Because they generate them from bits and pieces of sequences they've already seen.

Obviously language models are very different models of language than grammars, and they also have weights that can be used to select certain generations with priority, over others, but that's more work for you.

The example above is copied from here:

https://en.wikipedia.org/wiki/Definite_clause_grammar#Exampl...

There's a fuller example that shows how to build an actual parse tree but I left it out to avoid hurting your eyes:

https://en.wikipedia.org/wiki/Definite_clause_grammar#Parsin...

Again, all that's nothing to do with Transformers. It's just a way to understand how you can start with some encoding of one sentence, and generate many more. Fundamentally, language modelling works the same regardless of the specific model.

Edit: note also that the grammar above isn't compressing the original sentence "the cat eats a bat" at a very high rate, but if you take into account all the other sentences it can generate, that's a good rate of compression.