Hacker News new | ask | show | jobs
by CSMastermind 1066 days ago
Wouldn't making the model multimodal require scaling the models significantly?

Or is the idea to keep the network the same size and trade off some of its nodes for image, video, etc. data?

If so has anyone shown that doing so results in better overall performance?

My lay-observation is that GPT-4 seems to be on the border of usability for most applications so if nothing is gained by simply changing the input data type as opposed to expanding the model then it feels like it won't be of much use yet.

Also apologies if I'm not making sense, I'm almost certainly not using to correct technical terms to articulate what I'm thinking.

1 comments

> Wouldn't making the model multimodal require scaling the models significantly?

Just width if that makes sense. Basically, you add another encoder model but you are not actually increasing the depth that much.