Hacker News new | ask | show | jobs
by sashank_1509 1202 days ago
You seem to be grossly mistaken: https://openai.com/research/clip Image retrieval is mentioned zero times, original CLIP was built for robust classification. You provide a thousand words and contrastively it shows you the best match to your image. You can extend this by splitting the image into patches and classifying each path for detection, fine tuning a network on top of it for semantic segmentation .
1 comments

Apologies, I appreciate the correction and indeed I was mistaken. There is mention of retrieval in the end of the paper, but indeed the focus is on classification tasks. Here’s the relevant portion in any case.

> Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.