|
|
|
|
|
by selfhoster11
426 days ago
|
|
All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples. |
|
Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."