|
|
|
|
|
by emanuer
338 days ago
|
|
Could someone please help me understand how a multi-modal RAG does not already solve this issue?[1] What am I missing? Flash 2.5, Sonnet 3.7, etc. always provided me with very satisfactory image analysis. And, I might be making this up, but to me it feels like some models provide better responses when I give them the text as an image, instead of feeding "just" the text. [1] https://www.youtube.com/watch?v=p7yRLIj9IyQ |
|
You need to apply things like quantization, single-vector conversions (using fixed dimensional encodings), and better indexing to ensure that multimodal RAG works at scale.
That is exactly what we're doing at Morphik :)