|
|
|
|
|
by looobay
249 days ago
|
|
LLMs are compute heavy with quadratic scaling (in compute) per tokens. They are trying to compress text tokens into visual tokens with their VLM. Maybe they would render texts to an image before tokenizing to reduce the compute cost. |
|
So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.