| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by heyitsguay 1064 days ago
	I know it's not the main point of this, but... so many multimodal models now that take frozen vision encoders and language decoders and weld them together with a projection layer! I wanna grab the EVA02-CLIP-E image encoder and the Llama-2 33B model and do the same, I bet that'd be fun :D

2 comments

famouswaffles 1064 days ago

Qformer isn't necessary just to be clear. Llava is just a projection layer

link

GaggiX 1064 days ago

Not just a projection layer but also Q-former, in this case it was already trained for that specific vision encoder but if you change it you would need to train a Q-former from scratch.

link

famouswaffles 1064 days ago

Not for mini gpt-4 but it's just a projection layer for many others(like Llava). The Qformer isn't a necessary part of the equation.

link