Due to the way tokenization usually works with LLMs (using BPE β Byte Pair Encoding), there's actually usually already a 256-element embedding within the token-space that represents "raw bytes." You could say that this 256-element set is "pre-seeded" into any BPE encoding β and will remain as part of the encoding as long as at least one document in the dataset used to determine the tokenization, uses each byte at least once in a non-high-frequency-suffix-predictable way.
These tokens are also already very much in use by the tokenizer β they get emitted in sequences, to encode single Unicode codepoints that weren't common enough in the dataset to get their own tokens, and so instead require multiple tokens to represent them. I believe most tokenizers (e.g. tiktoken) just take the UTF-8 byte-sequences underlying these codepoints and encode them literally as sequences of the above 256-element set.
If you're curious, here's the definition of the encoding used by most modern LLMs, in newline-delimited "[base64 of raw input byte sequence] [tokenID to encode as]" format: https://openaipublic.blob.core.windows.net/encodings/cl100k_... . If you decode it, you can observe that the rest of the 256-element single-byte embedding space gets mapped to tokenIDs immediately following those of the ASCII printables.
Somewhat inefficient for text, very inefficient for images, specially if you work in pixel space. The max context a model today has been trained is 1M tokens, which takes up a lot of memory. Even if context was not an issue, to generate a 1000x1000 image would take ~3 hours on 100token/s inference.
Google has trained an encoder/decoder LLM on bytes called ByT5[1]
I think the work on multi-token prediction[0] within a single turn could be a significant development that makes byte-level tokenization models more practical. This approach allows the model to predict multiple tokens in parallel, potentially addressing the efficiency concerns raised about byte-level models.
By predicting multiple tokens simultaneously, it could significantly speed up inference time, especially for tasks that require generating large amounts of data (like images). This could help mitigate the performance bottleneck mentioned in the parent comment about generating a 1000x1000 image.
Forget bytes, go for bits. Vocab of size 2. At a theoretical level all of AI comes down to a classifier that is able to predict the next bit given a string of bits. Check out Tsetlin Machines. At some point we will be doing it in hardware.
- Serves as a form of compression. The main benefit of that is supporting longer sequences for any given context window. As a side benefit, it squeezes about the same amount of "information" into each token -- meaning you don't have to add any terms to your model to account for such an imbalance (or even test whether that hyperparameter matters).
- Allows you to insert stuff other than the raw data into your stream of "tokens" to the LLM. For something like a chatbot, that could be as simple as a prefix to whoever's talking next (e.g., system, user, model). You similarly probably want control characters to denote the end of a sequence. If you have multi-modal content (e.g., text + images), you need some way to delimit the transition between those. All of those problems could mostly be solved with an appropriate encoding scheme, but that's basically tokenization by a different name (in that it's a transformation from one set of tokens to another that you have to apply to every input).
You can solve that second problem trivially with just a vocabulary of 256 "byte" tokens plus O(1) control tokens, so that's not a huge deal in practice, just a point worth mentioning if we're talking about actually naively encoding bytes.
The first problem is more interesting. One observation is that if for your particular problem tokenization doesn't offer much compression, the difference won't matter much, or will favor raw bytes over tokenization if the tokenization isn't tailored to your particular data. IIRC there was something about Hebrew text floating around as an example of raw byte models performing better than tokenized models.
Another observation is that if your particular model has any form of compression for redundant state space (not true of any sort of vanilla transformer, mostly not true for any major competitor, technically possible regardless), especially if the cost of processing a token isn't substantially greater than the cost per byte of tokenizing an input, you also don't buy anything from tokenization. You're absolutely able to feed that raw data in and let the model handle the details.
On the flip side, suppose you're handling vanilla English text with a vanilla transformer. You can support something like 50x longer sequences basically for free by adding tokenization. You'd be silly not to.
Image transformers are slightly different in some sense, at least in typical implementations. The tokenization is lossy (not injective), and the de-tokenization must therefore have the opposite property (not a function -- or, since it is a function, it either doesn't reproduce every possible input image patch or has randomness to at least match the right distribution hopefully). They're often called the same thing, but I view that as something different from tokenization. Certain categories of problems (much like the English text example above) are made drastically cheaper by the process. Others (unlike the English text example above) are rendered impossible by the loss of information. A byte vocabulary makes those theoretically possible again, but you suddenly need a way to handle the "entropy per byte" problem which you didn't have to care about before.
Maybe one last idea, fuzzy detokenization (like in image transformers) has a notable advantage in spec adherence. Outputting an image and then letting some other hand-written code convert that to a png is much more likely to produce something usable than outputting a png directly, byte by byte. The whole thing is probabilistic, and the flurry of strategies you've seen along the lines of "decode while greedily adhering to a schema (json being the canonical example everyone wants to use for some reason, if you want to search for it)" produce the wrong output distribution, often drastically so, by virtue of the biased sampling on something only correct because of its conditional probabilities. I'm not sure exactly how big of a model you need (or how tailored of a loss function) to make a model reliably output correct, large png files, but the current SOTA isn't there yet for general-purpose problems.
In practice, people have made some byte-token models. They vary from "meh" to SOTA depending on the problem. On most problems, they're much more expensive than tokenized solutions. Interestingly, when they're SOTA they tend to be among the cheaper solutions.
I've been chipping away at some new model architectures, and something kind of like a byte-token solution is pretty suitable for those, largely because the model itself offers that compression you would otherwise obtain from tokenization. I'll finish and release them one of these years. For transformers though, the byte-token solution is usually only interesting insofar as proving people's suspicions. Results are fine, not amazing, except in special cases.
These tokens are also already very much in use by the tokenizer β they get emitted in sequences, to encode single Unicode codepoints that weren't common enough in the dataset to get their own tokens, and so instead require multiple tokens to represent them. I believe most tokenizers (e.g. tiktoken) just take the UTF-8 byte-sequences underlying these codepoints and encode them literally as sequences of the above 256-element set.
If you're curious, here's the definition of the encoding used by most modern LLMs, in newline-delimited "[base64 of raw input byte sequence] [tokenID to encode as]" format: https://openaipublic.blob.core.windows.net/encodings/cl100k_... . If you decode it, you can observe that the rest of the 256-element single-byte embedding space gets mapped to tokenIDs immediately following those of the ASCII printables.