| HN Mirror

I think the answer here is "it depends." The Llama-3.2 series is an extended version of the Llama-3.1 series with multimodal (image) training, but they kept the language model weights frozen and only updated the new image weights. So in the end, the 3.2 series benchmarks identically to 3.1 on text-only tasks; the image weights provided no value to the language model weights.

Allowing the language model weights to be updated during training could potentially result in better performance on both tasks, though, if Nvidia's result replicates. I could believe that it might: after all, more diverse data is more diverse data, and the model will be forced during training to generalize more.