|
|
|
|
|
by joelburget
704 days ago
|
|
Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks. |
|
It doesn’t seem to have the logic though to answer these questions.
The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...