|
|
|
|
|
by Lockal
931 days ago
|
|
No, GPT just stores a dictionary of most common letter sequences (tokens) - not always words, sometimes part of words. In GPT2 there was about 50 000 tokens - https://huggingface.co/roberta-base/raw/main/vocab.json . GPT4 uses vocabulary of 100 000 tokens (according to some sources, which I can't verify). While you may find it unusual for English, for some other languages like Japanese splitting text without spaces into tokens started many years ago. Otherwise processing of text is basically impossible there (there are no spaces in Japanese texts). |
|
The training process teaches LLMs how to compose these tokens to form replies to our queries. The training data used in the training process does not have obscured words or sentences with strange spacing. The LLM is still able compose the tokens correctly from varied input that never existed in the training data.
It is intelligence.