Hacker News new | ask | show | jobs
by ShamelessC 1203 days ago
In practice CLIP can be used for many things. Originally however, the primary focus was/is indeed retrieval. This is obvious from the contrastive loss used where they minimize errors with regard to a single hard positive from a batch of thousands of known hard negatives.

This is also informed by existing computer science objectives surrounding indexing, clustering of data and efficient search over data features.

1 comments

You seem to be grossly mistaken: https://openai.com/research/clip Image retrieval is mentioned zero times, original CLIP was built for robust classification. You provide a thousand words and contrastively it shows you the best match to your image. You can extend this by splitting the image into patches and classifying each path for detection, fine tuning a network on top of it for semantic segmentation .
Apologies, I appreciate the correction and indeed I was mistaken. There is mention of retrieval in the end of the paper, but indeed the focus is on classification tasks. Here’s the relevant portion in any case.

> Our studies of CLIP in a zero-shot setting show that the model displays significant promise for widely-applicable tasks like image retrieval or search. For example, it can find relevant images in a database given text, or relevant text given an image. Further, the relative ease of steering CLIP toward bespoke applications with little or no additional data or training could unlock a variety of novel applications that are hard for us to envision today, as has occurred with large language models over the past few years.