Y
Hacker News
new
|
ask
|
show
|
jobs
by
mountainriver
756 days ago
Vision language models use various encoders to project the image into tokens. This is just a means of a unified encoder across modalities