| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by derefr 700 days ago

Due to the way tokenization usually works with LLMs (using BPE — Byte Pair Encoding), there's actually usually already a 256-element embedding within the token-space that represents "raw bytes." You could say that this 256-element set is "pre-seeded" into any BPE encoding — and will remain as part of the encoding as long as at least one document in the dataset used to determine the tokenization, uses each byte at least once in a non-high-frequency-suffix-predictable way.

These tokens are also already very much in use by the tokenizer — they get emitted in sequences, to encode single Unicode codepoints that weren't common enough in the dataset to get their own tokens, and so instead require multiple tokens to represent them. I believe most tokenizers (e.g. tiktoken) just take the UTF-8 byte-sequences underlying these codepoints and encode them literally as sequences of the above 256-element set.

If you're curious, here's the definition of the encoding used by most modern LLMs, in newline-delimited "[base64 of raw input byte sequence] [tokenID to encode as]" format: https://openaipublic.blob.core.windows.net/encodings/cl100k_... . If you decode it, you can observe that the rest of the 256-element single-byte embedding space gets mapped to tokenIDs immediately following those of the ASCII printables.