Hacker News new | ask | show | jobs
by bad_username 9 days ago
> we don't send images to the model at query time. We describe each image once, at indexing time, with a cheap vision model, store the descriptions as text, and retrieve them alongside ordinary text chunks

This is what I've been doing in my Obsidian infodump for a while. If I know that an image is important, I generate a text description (Mermaid if possible, English if not) and paste it after the image in a block. This lets agents see the image if they don't really see it. Though my process is manual, the improvements in outcomes for agents that rely on text search/retrieval is very real and is worth it.

2 comments

For a RAG project for a client with a lot of PDFs and Powerpoints with images, I used ColPali a year ago. I see the provider ColiVara is still online but it seems to have fizzled out.

Retrieving based on text and then giving the generation model the image instead is much smarter than retrieving based on image. Image-based retrieval is slow and expensive.

Same with giving the model an image vs a structured representation of it.

> For a RAG project for a client with a lot of PDFs and Powerpoints with images, I used ColPali a year ago

How was the accuracy compared to pre-parsing the image and doing search in the text?

Leaps and bounds better! I don't think I benchmarked it.

But the experience was that it was able to find small details in PDFs, in technical diagrams, and this was really not captured well at all with OCR.

In general, OCR I think should be used more as an add-on to retrieve data, not given to the generation model itself. Similar to retrieving based off a text description and then giving the generation model the image.

What does Mermaid text description of an image mean?

Descriptions of images that are charts or diagrams to start with?

Most diagrams I come across are basically boxes and arrows which are representable with mermaid flow charts without losing information. The layout of the mermaid will usually look differently, but that is not typically what matters. ChatGPT is quite good in creating mermaid flow charts from random box and arrow diagram images.
Which cheap vision model would you recommend for ingesting category diagrams and producing mermaid facsimiles?
I haven't yet tried to solve this at any scale. So my models are ChatGPT (plus) in the browser, or Sonnet/Opus 4.x in Zoo Code.