Hacker News new | ask | show | jobs
by paradite 473 days ago
My burning question: Why not also make a slightly larger model (100B) that could perform even better?

Is there some bottleneck there that prevents RL from scaling up performance to larger non-MoE model?

2 comments

they have a larger model that is in previes and still training.