Hacker News new | ask | show | jobs
by tempusalaria 587 days ago
This is very similar to how LLMs are taught to understand images in llava style models (the image embeddings are encoded into the existing language token stream)