|
|
|
|
|
by kolja005
972 days ago
|
|
I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here. Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open. |
|