Hacker News new | ask | show | jobs
by the_duke 522 days ago
Side question: is there any good model that allows for image similarity detection across a large image set, that can be incrementally augmented with new images?

You'd somehow have to generate an embedding for each image, I presume.

2 comments

Yes, you could implement image similarity search using embeddings: Create embeddings for the entire image set, save the embeddings in a database, and add embeddings incrementally as new images come in. To search for a similar image, create the embedding for the image that you are looking for and compute the cosine similarity between that embedding and the embeddings in your database. The closer the cosine similarity is to 1.0 the more similar the images.

For choosing a model, the article mentions the AWS Titan multimodal model, but you’d have to pay for API access to create the embeddings. Alternatively, self-hosting the CLIP model [0] to create embeddings would avoid API costs.

Follow-up question: Would the embeddings from the llama3.2-vision models be of higher quality (contain more information) than the original CLIP model?

The llama vision models use CLIP under the hood, but they add a projection head to align with the text model and the CLIP weights are mutated during alignment training, so I assume the llama vision embeddings would be of higher quality, but I don’t know for sure. Does anybody know?

(I would love to test this quality myself but Ollama does not yet support creating image embeddings from the llama vision models - a feature request with several upvotes has been opened [1].)

[0] https://github.com/openai/CLIP

[1] https://github.com/ollama/ollama/issues/5304

So, there is a whole world with vision based RAG/search.

We have a good open-source repo here with a ColPali implementation: https://github.com/tjmlabs/ColiVara

Thanks for the link to the ColPali implementation - interesting! I am specifically interested in evaluation benchmarks for different image embedding models.

I see the ColiVara-Eval repo in your link. If I understand correctly, ColQwen2 is the current leader followed closely by ColPali when applying those models for RAG with documents.

But how do those models compare to each other and to the llama3.2-vision embeddings when applied to, for example, sentiment analysis for photos? Do benchmarks like that exist?

The “equivalent” here would be Jina-Clip (architecture-wise), not necessarily performance.

The ColPali paper(1) does a good job explaining why you don’t really want to directly use vision embeddings; and how you are much better off optimizing for RAG with a ColPali like setup. Basically, it is not optimized for textual understanding, it works if you are searching for the word bird; and images of birds. But doesn’t work well to pull a document where it’s a paper about birds.

1. https://arxiv.org/abs/2407.01449

Makes sense. My main takeaway from the ColPali paper (and your comments) is that ColPali works best for document RAG, whereas vision model embeddings are best used for image similarity search or sentiment analysis. So to answer my own question: The best model to use depends on the application.
Beyond using off-the-shelf embeddings, if you want to teach a model what "similar" means exactly, that's metric learning: https://paperswithcode.com/task/metric-learning