| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alekandreev 466 days ago
	Picking model sizes is not an exact science. We look for sizes that will fit quantized on different categories on devices (e.g., low-end and high-end smartphone, laptops and 16GB GPUs, and bigger GPUs/TPUs). We also want the ratio of model width to depth (number of layers) to be consistently around 90, which we found works best. The models are trained with distillation from a bigger teacher. We train them independently, but for v3 we have unified the recipes for 4B-27B, to give you more predictably when scaling up and down to different model sizes.

2 comments

magicalhippo 466 days ago

Thanks again, very interesting.

One unexpected (to me) use-case appeared not long ago when I found myself without internet but wanting to fix some non-standard Linux configuration issue. As a Windows guy I tend to web search such things, but local LLM to the rescue!

Even smaller models like Gemma 2 9B has enough compressed knowledge that it managed to help me quickly solve my issue.

This got me thinking how such smaller, but very capable models might be a game-changer in communities where internet might not be available or too expensive for continuous use. It's almost like having a portion of the internet in a box, just add electricity.

alekandreev 466 days ago

Thank you for the feedback! This is why we are so excited to push more and more on small models for both low end and high end smartphones!

bguberfain 466 days ago

Can you provide more information about this “bigger teacher” model?