Hacker News new | ask | show | jobs
by whimsicalism 1066 days ago
> Wouldn't making the model multimodal require scaling the models significantly?

Just width if that makes sense. Basically, you add another encoder model but you are not actually increasing the depth that much.