| Text tokens are quantized and represent subword units, vision tokens only exist in the embedding space. The way text tokenization works in LLMs is that you have a "lookup table" of (small) token ids to (large) vector embeddings. To pass text to the LLM, you split it at token boundaries, convert strings to token ids, and then construct the "context", a matrix where each row is a vector taken from that lookup table. Transmitting text token sequences can be relatively efficient, you just transmit the token IDs themselves[1]. They're small integers (~100k possible token ids is typical for large models). Transmitting the actual embeddings matrix would be far less efficient, as embeddings often consist of thousands of floating point numbers. Images are encoded differently. After some basic preprocessing, image data is passed straight to a neural- network-based image encoder. That encoder encodes the image into vectors, which are then appended to the context. There are no token ids, there's no lookup table, we go straight from image data to token embeddings. This means transmitting image tokens cannot be done as efficiently, as you'd have to transmit the embeddings themselves. Even though an image is encoded in fewer tokens, the most efficient representation of those tokens takes more bytes. You can think of a text token as an integer between 0 and n, which we know how to map to a vector. This means you have `n` possible choices of tokens. In contrast, an image token is an array of m floating point numbers (the vector itself), each of which can take on many possible values. This means the "token space" of vision tokens is actually much larger. There's also the issue of patterns. Text tokens correspond directly to a contiguous span of UTF-8 bytes, and most tokenizers won't create tokens that span word boundaries. This means they can't encode global patterns efficiently. You can't have a "Hamlet's monologue" or "the text that follows is in Spanish" token. |