Hacker News new | ask | show | jobs
by nathanasmith 699 days ago
They also said in the paper that 405B was only trained to "compute-optimal" unlike the smaller models that were trained well past that point indicating the larger model still had some runway so had they continued it would have kept getting stronger.
1 comments

Makes sense right? Otherwise why make a model so large that nobody can conceivably run it if not to optimize for performance on a limited dataset/compute? It was always a distillation source model, not a production one.