|
|
|
|
|
by jonathan-adly
515 days ago
|
|
The “equivalent” here would be Jina-Clip (architecture-wise), not necessarily performance. The ColPali paper(1) does a good job explaining why you don’t really want to directly use vision embeddings; and how you are much better off optimizing for RAG with a ColPali like setup. Basically, it is not optimized for textual understanding, it works if you are searching for the word bird; and images of birds. But doesn’t work well to pull a document where it’s a paper about birds. 1. https://arxiv.org/abs/2407.01449 |
|