Hacker News new | ask | show | jobs
by cubacaban 1282 days ago
Mapping corresponding text and image into the same vector space is exactly how CLIP and other contrastive learning setups work: Have text and computer vision networks embed input data, and teach the networks to embed related inputs closely and unrelated ones far away. You can train the models on data scraped from the internet (e.g. images and corresponding captions) in a self-supervised way. Like you say it has huge applications for search and is also used by the generative models (to guide a generated image towards your textual input).

BTW it seems that the data2vec is at its core actually just one model, only the input and output parts are different depending on the modality (text, image sound etc.) I wouldn't expect the learned representations to be similar for similar content across modalities. The point of data2vec is to use just one model with a very general self-supervised learning setup for tasks, with the different tasks hopefully benefitting from each other during training.