Hacker News new | ask | show | jobs
by brookst 823 days ago
The paper explores different design choices for various parts of the model and draws conclusions about the relative importance of optimizing each area (image encoder very important, vision-language connector less so).

The actual set of models produced (up to 30B parameters) seems secondary to the intent of the paper, and is more validation of the best design choices in each area.