Hacker News new | ask | show | jobs
by marcinzm 766 days ago
That seems odd since I also don't see how this differs from other approaches being published. Except what everyone else calls an Image Encoder (ie: some type of pre-trained VAE architecture) they call a tokenizer. The Apple MM1 paper used ViT-L for example for it's image encoder and then C-Abstractor for it's image tokenizer.
1 comments

the biggest difference is that existing multimodal models (eg GPT-4V and MM1) trained the text model first, and then added in the image component after text training was done ('late fusion'). MM1 learns a projection into the text space, not discrete tokens, and thus cannot generate images.

Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.

You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.