| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kolja005 972 days ago
	I probably could have guessed that contrastive pre-training works better for downstream vision-language tasks than image classification pre-training, but it's nice to see this hypothesis thoroughly tested here. Also, hacks to get LLMs to generate structured output seem to be mostly getting the job done. I'm less optimistic about this approach for traditional vision tasks where language is the interface, however. Are we going to get models to output a pixel-wise segmentation mask as text? I want to doubt it but seeing how LLMs are about to output long sequences of structured text leaves my mind open.

1 comments

mattnewton 972 days ago

I don’t see why not- “segment anything” from meta seems to handle labeled pixel-wise segmentation masks fairly well. You can also get rough masks today by looking at where the text part of the model attends to in the image part.

link