Hacker News new | ask | show | jobs
by Legend2440 336 days ago
It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.