Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?
I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.
7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)
GPT-3 had 175B, and the original ChatGPT was probably just a GPT-3 finetune (although they called it gpt-3.5, so it could have been different). However, it was severely undertrained. Llama-3.1-8B is better in most ways than the original ChatGPT; a well-trained ~70B usually feels GPT-4-level. The latest Llama release, llama-3.3-70b, goes toe-to-toe even with much larger models (albeit is bad at coding, like all Llama models so far; it's not inherent to the size, since Qwen is good, so I'm hoping the Llama 4 series is trained on more coding tokens).