| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aconz2 755 days ago

> Recent multimodal foundation models are very widely adopted but still model different modalities separately, often using modality specific encoders or decoder

Is this accurate? I thought for example gemini pro used image tokens and gpt4-o similar

> without the need for separate image/text encoders

but then they say they pre-trained two different tokenizers, so maybe they just mean that the tokens go into the same attention layer? But then I thought that is how all the multi-modal stuff was happening already?

two typos stabilitize and multiplicate

2 comments

marcinzm 755 days ago

That seems odd since I also don't see how this differs from other approaches being published. Except what everyone else calls an Image Encoder (ie: some type of pre-trained VAE architecture) they call a tokenizer. The Apple MM1 paper used ViT-L for example for it's image encoder and then C-Abstractor for it's image tokenizer.

link

huac 755 days ago

the biggest difference is that existing multimodal models (eg GPT-4V and MM1) trained the text model first, and then added in the image component after text training was done ('late fusion'). MM1 learns a projection into the text space, not discrete tokens, and thus cannot generate images.

Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.

You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.

link

mountainriver 755 days ago

Vision language models use various encoders to project the image into tokens. This is just a means of a unified encoder across modalities

link