Hacker News new | ask | show | jobs
by travisgriggs 547 days ago
Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?
4 comments

I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.
There are all sizes of models from a few GB to hundreds of GB. Small presumably means small enough to run on end-user hardware.
7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)
How many parameters did ChatGPT have in Dec 2022 when it first broke into mainstream news?
GPT-3 had 175B, and the original ChatGPT was probably just a GPT-3 finetune (although they called it gpt-3.5, so it could have been different). However, it was severely undertrained. Llama-3.1-8B is better in most ways than the original ChatGPT; a well-trained ~70B usually feels GPT-4-level. The latest Llama release, llama-3.3-70b, goes toe-to-toe even with much larger models (albeit is bad at coding, like all Llama models so far; it's not inherent to the size, since Qwen is good, so I'm hoping the Llama 4 series is trained on more coding tokens).
> However, it was severely undertrained

by modern standards. at the time, it was trained according to neural scaling laws oai believed to hold.

Sure, at the time everyone misunderstood Chinchilla. Nonetheless it was severely undertrained, even if they didn't know it back then.
I don't think that's ever been shared, but it's predecessor GPT-3 Da Vinci was 175B.

One of the most exciting trends of the past year has been models getting dramatically smaller while maintaining similar levels of capability.

It's a marketing term for the idea that quality over quantity in training data will lead to smaller models that work as well as larger models.