| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nathanasmith 699 days ago
	They also said in the paper that 405B was only trained to "compute-optimal" unlike the smaller models that were trained well past that point indicating the larger model still had some runway so had they continued it would have kept getting stronger.

1 comments

moffkalast 699 days ago

Makes sense right? Otherwise why make a model so large that nobody can conceivably run it if not to optimize for performance on a limited dataset/compute? It was always a distillation source model, not a production one.

link