|
|
|
|
|
by spott
18 days ago
|
|
This is just early fusion basically. FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818 I've been waiting for something like this to be released since then. The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output). |
|