Y
Hacker News
new
|
ask
|
show
|
jobs
by
Legend2440
336 days ago
It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.