|
|
|
|
|
by bad_username
9 days ago
|
|
> we don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks This is what I've been doing in my Obsidian infodump for a while. If I know that an image is important, I generate a text description (Mermaid if possible, English if not) and paste it after the image in a block. This lets agents see the image if they don't really see it. Though my process is manual, the improvements in outcomes for agents that rely on text search/retrieval is very real and is worth it. |
|
Retrieving based on text and then giving the generation model the image instead is much smarter than retrieving based on image. Image-based retrieval is slow and expensive.
Same with giving the model an image vs a structured representation of it.