Hacker News new | ask | show | jobs
by sashank_1509 1207 days ago
They seem to be only testing for the image retrieval task, but I don’t think CLIP is actually used for image retrieval. Most cases, I see CLIP being used for semantic segmentation, detection etc. Do these guys have similar results on these tasks?
2 comments

Hi! I am one of the contributors! We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf. Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.
I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval. As an ML researcher in grad school here’s what >80% use case of clip I’ve seen: 1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way
Hi! You are right that we had to clarify that "100 times better at retrieval". Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)
In practice CLIP can be used for many things. Originally however, the primary focus was/is indeed retrieval. This is obvious from the contrastive loss used where they minimize errors with regard to a single hard positive from a batch of thousands of known hard negatives.

This is also informed by existing computer science objectives surrounding indexing, clustering of data and efficient search over data features.

You seem to be grossly mistaken: https://openai.com/research/clip Image retrieval is mentioned zero times, original CLIP was built for robust classification. You provide a thousand words and contrastively it shows you the best match to your image. You can extend this by splitting the image into patches and classifying each path for detection, fine tuning a network on top of it for semantic segmentation .
Apologies, I appreciate the correction and indeed I was mistaken. There is mention of retrieval in the end of the paper, but indeed the focus is on classification tasks. Here’s the relevant portion in any case.

> Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.