| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by GaggiX 1061 days ago
	>It means you can train a chinchilla-optimal TinyLlama (1.1B param, 22B tokens) in 32 hours with 8 A100. They are training the model on 3000/22=136 times the value of the chinchilla scale. It will be interesting to see how much it will improve after way beyond this value.

4 comments

npsomaratna 1061 days ago

Possibly a lot. See: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

link

isoprophlex 1061 days ago

Very interesting, thanks for sharing!

link

pluijzer 1061 days ago

I now come to understand that the technobable in Star Trek wasn't that well predicted, in the future we will not be reversing polarities by alligning field cores. Picard will have us align our llamas with chiwawas to get an alpacafied chinchilla model.

link

elpocko 1061 days ago

Lora and Alpaca at Tanagra.

link

DarmokJalad1701 1061 days ago

Llama, when the loss fell.

link

kmlx 1061 days ago

from this episode if i’m not mistaken: https://en.m.wikipedia.org/wiki/Darmok

i watched that series so many times…

link

DarmokJalad1701 1061 days ago

Hence my username.

link

koprulusector 1061 days ago

There’s should also be a tribble in there, somewhere.

link

sp332 1061 days ago

Chinchilla predicts that you could get lower loss by training a larger model with that amount of data. But the model size in this case was chosen for other reasons, mostly speed of inference and cost of fine-tuning. So it's just irrelevant here.

link

GaggiX 1061 days ago

Well it's relevant if you want to compare the model trained optimally using the same amount of compute and this one parameter-bound to see how much you're trading.

link

cypress66 1061 days ago

It's a bit amusing how people treat chinchilla scaling laws as a law of nature, when it's just about a certain architecture and dataset.

link