Hacker News new | ask | show | jobs
by joelburget 704 days ago
Vision Transformers do a shocking amount of compression in the tokenizer. In the [Chameleon paper](https://arxiv.org/pdf/2405.09818) they say the tokenizer "encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192". That's 256 pixels per token (512 * 512 / 1024). If we assume that a pixel is 24 bits (3x 8 bit channels), this implies that they've compressed 256 * 24 = 6144 bits into 13 = (log2(8192)). [An Image is Worth 32 Tokens for Reconstruction and Generation](https://yucornetto.github.io/projects/titok.html) pushes this even further. If these models work similarly, it's no wonder they struggle with some vision tasks.
3 comments

It’s not as simple as that. If you ask GPT-4o to create a copy of these images, it generally creates one faithfully (e.g. an image with 5 squares will be produced), so it’s “seeing” things reasonably enough.

It doesn’t seem to have the logic though to answer these questions.

The complete data set is here to play around with it yourself: https://huggingface.co/datasets/XAI/vlmsareblind/viewer/defa...

GPT-4o is very good at some visual tasks like optical character recognition. So the selective blindness might just be what you say here -- all of its capacity is dedicated to minimizing loss on a few narrow tasks that had the most training data (like OCR). So it's not necessarily an inherent failure of the architecture to generalize, it could just be a capacity issue that will naturally be resolved with more scale.
Is that not just traditional OCR applied on top of LLM?
It's possible they have a software layer that does that. But I was assuming they don't, because the open source multimodal models don't.
No it’s not, it’s a multimodal transformer model.
for some reason I started thinking about trying to describe the taste of a fruit to someone who hasn't tried it as something that can be similar to this as a non-visual sensory modal in humans