|
|
|
|
|
by ResearchAtPlay
526 days ago
|
|
Yes, you could implement image similarity search using embeddings: Create embeddings for the entire image set, save the embeddings in a database, and add embeddings incrementally as new images come in. To search for a similar image, create the embedding for the image that you are looking for and compute the cosine similarity between that embedding and the embeddings in your database. The closer the cosine similarity is to 1.0 the more similar the images. For choosing a model, the article mentions the AWS Titan multimodal model, but you’d have to pay for API access to create the embeddings. Alternatively, self-hosting the CLIP model [0] to create embeddings would avoid API costs. Follow-up question: Would the embeddings from the llama3.2-vision models be of higher quality (contain more information) than the original CLIP model? The llama vision models use CLIP under the hood, but they add a projection head to align with the text model and the CLIP weights are mutated during alignment training, so I assume the llama vision embeddings would be of higher quality, but I don’t know for sure. Does anybody know? (I would love to test this quality myself but Ollama does not yet support creating image embeddings from the llama vision models - a feature request with several upvotes has been opened [1].) [0] https://github.com/openai/CLIP [1] https://github.com/ollama/ollama/issues/5304 |
|
We have a good open-source repo here with a ColPali implementation: https://github.com/tjmlabs/ColiVara