| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Lockal 931 days ago
	No, GPT just stores a dictionary of most common letter sequences (tokens) - not always words, sometimes part of words. In GPT2 there was about 50 000 tokens - https://huggingface.co/roberta-base/raw/main/vocab.json . GPT4 uses vocabulary of 100 000 tokens (according to some sources, which I can't verify). While you may find it unusual for English, for some other languages like Japanese splitting text without spaces into tokens started many years ago. Otherwise processing of text is basically impossible there (there are no spaces in Japanese texts).

1 comments

corethree 931 days ago

The token system used by large language models like GPT-4 is designed to be comprehensive enough to represent virtually any text, including every possible word that could exist in a language. This is separate from the training the neural net and is chosen deliberately.

The training process teaches LLMs how to compose these tokens to form replies to our queries. The training data used in the training process does not have obscured words or sentences with strange spacing. The LLM is still able compose the tokens correctly from varied input that never existed in the training data.

It is intelligence.

link

Lockal 931 days ago

Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.

And even then ChatGPT fails to segment "policecaughttherapist" (https://chat.openai.com/share/21c7596a-6474-4639-8a92-5cea54...), even though:

1) If I talked about a therapist, sentence would look like "police caught _the_ therapist"

2) How often do the police even catch therapists? Come on, it looks like the training set was just heavily censored. No intelligence, just a broken ngram database (where n = length of articles in training set, see https://news.ycombinator.com/item?id=38458683).

link

corethree 931 days ago

>Yes and no. I know that there is no word "understand" in the dictionary, only "under" and "stand", but other than that it is just a large table with probabilities to see tokens in specific context.

The training data is more important then the dictionary because the dictionary is designed to be able to form every possible combination of words and sentences that can be formed. It is not limited to specific words it is building words and sentences from building blocks.

1. That parsing is valid. Though unlikely. The choice it made is not incorrect. Thus not a sign of lack of intelligence.

2. Not often. But if you ask chatGPT to reinterpret the word in another way that is grammatically correct it will find the rapist. It shows definitively there is no censorship of the word.

3. I actually didn't see the alternative myself for some reason. Therapist jumped out at me and I didn't see what you were talking about for a good couple of minutes. I mean, unless you want to think of me (a human) as not "intelligent" then clearly it's not a factor here.

link