Y
Hacker News
new
|
ask
|
show
|
jobs
by
runeblaze
240 days ago
each text token is often subword unit, but in VLMs the visual tokens are in semantic space. Semantic space obviously compresses much more than subword slices.
disclaimer: not expert, on top of my head