Hacker News new | ask | show | jobs
by philomath868 298 days ago
I hear you loud and clear... Thanks!

What about deleting vision layers (e.g. the "multi_modal_projector" and the "vision_tower.vision_model" layers, assuming I go with Gemma 3), since I need just language generation? Would that also be considered a "kick in the balls", or a useful trimming?

1 comments

Should be safe to do, as long as none of that is load bearing. If it's the usual naive "massage the image into a hundred tokens and throw that into the context" vision implementation, nothing bad would happen from removing or just freezing them.

I've seen "cut off unused vision inputs" done for older multimodals, just not the newer Gemma 3.