|
|
|
|
|
by the_duke
522 days ago
|
|
Side question: is there any good model that allows for image similarity detection across a large image set, that can be incrementally augmented with new images? You'd somehow have to generate an embedding for each image, I presume. |
|
For choosing a model, the article mentions the AWS Titan multimodal model, but you’d have to pay for API access to create the embeddings. Alternatively, self-hosting the CLIP model [0] to create embeddings would avoid API costs.
Follow-up question: Would the embeddings from the llama3.2-vision models be of higher quality (contain more information) than the original CLIP model?
The llama vision models use CLIP under the hood, but they add a projection head to align with the text model and the CLIP weights are mutated during alignment training, so I assume the llama vision embeddings would be of higher quality, but I don’t know for sure. Does anybody know?
(I would love to test this quality myself but Ollama does not yet support creating image embeddings from the llama vision models - a feature request with several upvotes has been opened [1].)
[0] https://github.com/openai/CLIP
[1] https://github.com/ollama/ollama/issues/5304