| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cs-fan-101 1166 days ago
	Recently, we announced in this post (https://news.ycombinator.com/item?id=35343763#35345980) the release of Cerebras-GPT — a family of open-source GPT models trained on the Pile dataset using the Chinchilla formula. Today, we are excited to announce the availability of the Cerebras-GPT research paper on arXiv.

3 comments

krasin 1166 days ago

Thank you for open sourcing these models!

I mentioned that the sizes of the models are relatively small (13B max). Is it an inherent limitation, or training a bigger model is possible, just has not been done in this exercise?

link

runnerup 1166 days ago

Someone else can answer this better than I, so I'll probably end up deleting this in an hour or two. But I think the purpose of this research was not to create an excellent GPT model. I believe it was to explore the scaling effects on Cerebras hardware and determine a helpful framework for compute-optimal training regimes so that customers who might use Cerebras hardware can be confident that:

1) Standard AI/ML scaling assumptions still apply on this hardware.

2) They have a starting point for hyper-parameter estimation and can get better results sooner.

link

krasin 1166 days ago

> But I think the purpose of this research was not to create an excellent GPT model.

Yes, understood. I feel that this phrase is a response to the other commenter that suggested that Cerebras should release a ChatGPT-competitive model. I don't think it's easy and I don't think it's a focus for a hardware maker, such as Cerebras.

> I believe it was to explore the scaling effects on Cerebras hardware and determine a helpful framework for compute-optimal training regimes so that customers who might use Cerebras hardware can be confident that:

> 1) Standard AI/ML scaling assumptions still apply on this hardware.

This is my point. Is it possible to train a 100B model on Cerebras hardware? 500B? In this respect, the quality is secondary to the capability for the purpose of demonstration of capabilities.

link

bee_rider 1166 days ago

> Maximal Update Parameterization (μP)

The use of μ (mu) as a sort of… pun acronym thing is pretty clever, nice one.

link

groodt 1166 days ago

Thanks for publishing this. I quickly skimmed the paper, I saw the impressive linear scaling as you scaled to 16 nodes. How long did it take to train the various models in wall clock time?

link