|
|
|
|
|
by vov_or
1203 days ago
|
|
There is not only a difference in the data source but pre-trained tasks as well.
But you are right, a fine-tuned models on human-annotated data are way better than zero-shot (just pre-trained) on Image retrieval.
And it is correct for CLIP, ALBEF, VICHA, and UFORM. |
|