Hacker News new | ask | show | jobs
by mountainriver 756 days ago
Vision language models use various encoders to project the image into tokens. This is just a means of a unified encoder across modalities