|
|
|
|
|
by psb217
237 days ago
|
|
The trick is that the vision tokens are continuous valued vectors, while the text tokens are elements from a small discrete set (which are converted into continuous valued vectors by a lookup table). So, vision tokens can convey significantly more bits per token than text tokens. This allows them to pack the content of multiple text tokens into a single vision token. |
|