| It's not often I see something that's fractally wrong but here we are. There is a dictionary, it's called the tokenizer. There are grammar rules, they are just very weak because the structure of human language is generally quite weak. When presented with languages which have strong consistent grammars the weights are very easily interpretable as a grammar: https://arxiv.org/abs/2201.02177 The point of the original short story is that the computational substrate doesn't matter when you have Turing completeness. This one seems to think that you don't need structure and interpretability just because you change substrates. |
At best, it's a wordlist. It gives the LLM some idea of what humans consider to be common words. But it doesn't tell the LLM anything at all about those words. And it's not even comprehensive, many words map to multiple tokens. Nor is it exclusively words, some of those tokens are punctuation, or modifiers, or control tokens. On multimodal LLMs, some of the tokens actually represent image and audio data.
The LLM doesn't get informed about any of this up front, it has to learn what every single token means from context.
You are technically right, that it's something in an LLM that's not weights; But it's not that structured. And really it's only there so the LLM can interact with the outside world.
> There are grammar rules
There is no dedicated "grammar rule" structure in the LLM or the tokeniser. It has to learn them all from context, they get encoded as part of the 80 layers of weights.