Hacker News new | ask | show | jobs
by vov_or 1203 days ago
Hi! I am one of the contributors! We were focused on image retrieval only. Almost all semantic search engines for images are based on CLIP today. We are also building a semantic multimodal search engine as a DBMS component. That is why Image retrieval is so crucial for us as well as inference perf. Also, for semantic segmentation and detection, you probably use only the image encoder part of the CLIP.
1 comments

I think it’s fine you’re focused on retrieval but you should add that as a caveat to your results, 100 times better at retrieval. As an ML researcher in grad school here’s what >80% use case of clip I’ve seen: 1. Take a random image, take a random set of text (can just be categories separated by commas). CLIP will find the text that’s the best match to your image. CLIP is also incredibly robust at this, you can literally take an image with your phone and it will give you reasonable results. If you speed such a model up by 100X in inference or training, that would be a huge deal to the entire ML research community and you can expect some best paper awards (maybe even VC capital looking at stable diffusion) to come your way
Hi! You are right that we had to clarify that "100 times better at retrieval". Btw, we have plans to tune models, evaluate, and publish results in different tasks (zero-shot ImageNet classification, etc)