|
|
|
|
|
by corethree
931 days ago
|
|
The token system used by large language models like GPT-4 is designed to be comprehensive enough to represent virtually any text, including every possible word that could exist in a language. This is separate from the training the neural net and is chosen deliberately. The training process teaches LLMs how to compose these tokens to form replies to our queries. The training data used in the training process does not have obscured words or sentences with strange spacing. The LLM is still able compose the tokens correctly from varied input that never existed in the training data. It is intelligence. |
|
And even then ChatGPT fails to segment "policecaughttherapist" (https://chat.openai.com/share/21c7596a-6474-4639-8a92-5cea54...), even though:
1) If I talked about a therapist, sentence would look like "police caught _the_ therapist"
2) How often do the police even catch therapists? Come on, it looks like the training set was just heavily censored. No intelligence, just a broken ngram database (where n = length of articles in training set, see https://news.ycombinator.com/item?id=38458683).