Hacker News new | ask | show | jobs
by CGamesPlay 335 days ago
This makes sense, but is something to shaking up the RAG pipeline? Perhaps you could take each RAG result and then do a model processing step to ask it to extract relevant information from the image directly pertaining to the user query, once per result, and then aggregate those (text) results as the input to your final generation. That would sidestep the token limit for multiple images, and allow parallelizing the image understanding step.