| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 383toast 12 days ago
	why not a multimodal embedding model?

2 comments

efavdb 12 days ago

Article says this misses important details, eg data that might be in the image.

link

breadislove 12 days ago

very bad take. with most modern multomodal models you get way better performance then going to text first

link

emil_sorensen 12 days ago

it's a cost/latency trade-off in production + very use-case dependent

link

sateesh 12 days ago

The article do mentions why they don't use multimodal retrieval. Also I think this approach is cheaper (compute wise) than multimodal retrieval. From the article:

  Multimodal retrieval does not suit this domain. CLIP-style embeddings wash out exactly the fine detail that matters in charts, tables, and annotated screenshots, and short technical queries ("how do I configure X") give too little signal to match against image vectors

link