Hacker News new | ask | show | jobs
by vov_or 1203 days ago
There is not only a difference in the data source but pre-trained tasks as well. But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval. And it is correct for CLIP, ALBEF, VICHA, and UFORM.
1 comments

Any plans to document how to fine tune your models then?
It will take some time, but yes, we have this in our plans.