|
|
|
|
|
by sashank_1509
1202 days ago
|
|
You seem to be grossly mistaken: https://openai.com/research/clip
Image retrieval is mentioned zero times, original CLIP was built for robust classification. You provide a thousand words and contrastively it shows you the best match to your image. You can extend this by splitting the image into patches and classifying each path for detection, fine tuning a network on top of it for semantic segmentation . |
|
> Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.