| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by travisgriggs 593 days ago
	Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?

4 comments

hagen_dogs 593 days ago

I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.

link

dboreham 593 days ago

There are all sizes of models from a few GB to hundreds of GB. Small presumably means small enough to run on end-user hardware.

link

narag 593 days ago

7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)

link

agnishom 593 days ago

How many parameters did ChatGPT have in Dec 2022 when it first broke into mainstream news?

link

reissbaker 593 days ago

GPT-3 had 175B, and the original ChatGPT was probably just a GPT-3 finetune (although they called it gpt-3.5, so it could have been different). However, it was severely undertrained. Llama-3.1-8B is better in most ways than the original ChatGPT; a well-trained ~70B usually feels GPT-4-level. The latest Llama release, llama-3.3-70b, goes toe-to-toe even with much larger models (albeit is bad at coding, like all Llama models so far; it's not inherent to the size, since Qwen is good, so I'm hoping the Llama 4 series is trained on more coding tokens).

link

swyx 593 days ago

> However, it was severely undertrained

by modern standards. at the time, it was trained according to neural scaling laws oai believed to hold.

link

reissbaker 590 days ago

Sure, at the time everyone misunderstood Chinchilla. Nonetheless it was severely undertrained, even if they didn't know it back then.

link

simonw 593 days ago

I don't think that's ever been shared, but it's predecessor GPT-3 Da Vinci was 175B.

One of the most exciting trends of the past year has been models getting dramatically smaller while maintaining similar levels of capability.

link

tbrownaw 593 days ago

It's a marketing term for the idea that quality over quantity in training data will lead to smaller models that work as well as larger models.

link