Hacker News new | ask | show | jobs
by fudged71 92 days ago
This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
1 comments

Probably constrained by training resources. It's much easier to experiment with a smaller architecture. You may need many training runs to figure out hyperparameters for example. If each run needs multiple GPUs for a week the cost adds up quickly. I think it makes a lot of sense to start small.