|
|
|
|
|
by aconz2
755 days ago
|
|
> Recent multimodal foundation models are very widely adopted but still model different modalities separately, often using modality specific encoders or decoder Is this accurate? I thought for example gemini pro used image tokens and gpt4-o similar > without the need for separate image/text encoders but then they say they pre-trained two different tokenizers, so maybe they just mean that the tokens go into the same attention layer? But then I thought that is how all the multi-modal stuff was happening already? two typos stabilitize and multiplicate |
|