Hacker News new | ask | show | jobs
by 383toast 12 days ago
why not a multimodal embedding model?
2 comments

Article says this misses important details, eg data that might be in the image.
very bad take. with most modern multomodal models you get way better performance then going to text first
it's a cost/latency trade-off in production + very use-case dependent
The article do mentions why they don't use multimodal retrieval. Also I think this approach is cheaper (compute wise) than multimodal retrieval. From the article:

  Multimodal retrieval does not suit this domain. CLIP-style embeddings wash out exactly the fine detail that matters in charts, tables, and annotated screenshots, and short technical queries ("how do I configure X") give too little signal to match against image vectors