| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by emanuer 338 days ago

Could someone please help me understand how a multi-modal RAG does not already solve this issue?[1]

What am I missing?

Flash 2.5, Sonnet 3.7, etc. always provided me with very satisfactory image analysis. And, I might be making this up, but to me it feels like some models provide better responses when I give them the text as an image, instead of feeding "just" the text.

[1] https://www.youtube.com/watch?v=p7yRLIj9IyQ

1 comments

ArnavAgrawal03 338 days ago

Multimodal RAG is exactly what we argue for. In their original state, though, multivectors (that form the basis for multi-modal RAG) are very unwieldy - computing the similarity scores is very expensive and so scaling them up in this state is hard.

You need to apply things like quantization, single-vector conversions (using fixed dimensional encodings), and better indexing to ensure that multimodal RAG works at scale.

That is exactly what we're doing at Morphik :)

link

barrenko 337 days ago

And the Gemini(s) aren't already doing this at GoogleCorp?

link