|
|
|
|
|
by marcinzm
766 days ago
|
|
That seems odd since I also don't see how this differs from other approaches being published. Except what everyone else calls an Image Encoder (ie: some type of pre-trained VAE architecture) they call a tokenizer. The Apple MM1 paper used ViT-L for example for it's image encoder and then C-Abstractor for it's image tokenizer. |
|
Other work allows the model during training to learn the 'tokenization' more explicitly. that's more similar to Adept's Fuyu architecture, which I am personally a fan of, but also does not enable generating images out.
You can generate images using late fusion as well, though I am not aware of other public work that discloses both early fusion and image generation.