|
|
|
|
|
by wolfgang42
1108 days ago
|
|
IIRC from poking around in the LLaMA internals (I assume ChatGPT is the same since it’s the obvious way to handle this): the token list has a complete set of tokens of length 1. This means that in the degenerate case where the tokenizer can’t compose the text out of any other tokens it’ll still be processable, just as a collection of single-character tokens that the language model presumably has vaguer associations for. (Which I imagine doesn’t actually affect things significantly; if you added more tokens for less-frequently-seen strings, it still wouldn’t have much of an idea what to do with them.) |
|