Hacker News new | ask | show | jobs
by ianbicking 335 days ago
Using modern tools I would naturally be inclined to:

1. Have the LLM see the image and produce an text version using a kind of semantic markup (even hallucinated markup)

2. Use that text for most of the RAG

3. If the focus (of analysis or conversation) converges one image, include that image in the context in addition to the text

If I use a simple prompt with GPT 4o on the Palantir slide from the article I get this: https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... – seems pretty good!