Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities