|
|
|
|
|
by jamesk_au
1108 days ago
|
|
Can anyone comment on how the limits on GPT-X’s token space translate to limits on its vocabulary (with corresponding limits on understanding input and generating output)? For example, is GPT-4’s list of ~100k tokens sufficient to understand and generate every non-obsolete word in the English language (per, say, a standard dictionary)? Or even every word in the training data? If not, do we have examples of ordinary words that it is impossible for GPT-4 ever to understand or generate? What happens when it encounters those words and is unable to tokenize them; are they simply ignored (eg omitted from the input vector, or set to 0 or some sort of null token)? |
|