|
|
|
|
|
by serjester
428 days ago
|
|
I think anyone that cares enough about embedding performance to use niche models is probably parsing their PDF's into some sort of textual format. Otherwise you need orient your all your pipelines to handle images which adds significant complexity (hybrid search, reranking, LLM calls, etc - all way harder with images). Not to mention an image is optimistically 50 KB vs the same page represented as markdown is maybe 2–5 KB. When you're talking about pulling in potentially hundreds of pages, that's a 10–20x increase in storage, memory usage, and network overhead. I do wish they had a more head-to-head comparison with voyage. I think they're the de facto king of proprietary embeddings and with Mongo having bought them, I'd love to migrate away once someone can match their performance. |
|