|
|
|
|
|
by krackers
247 days ago
|
|
But naively wouldn't you expect the representation of a piece of text in terms of vision tokens to be roughly the same number of bits (or more) than the representation as textual token? You're changing representation sure, but that by itself doesn't give you any compute advantages unless there is some sparsity/compressability you can take advantage of in the domain you transform to right? So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens. |
|