This is really promising. Are they now going to scale this up to hundreds of billions of parameters? Why stop at 1.5B if they found a potentially SOTA architecture?
Probably constrained by training resources. It's much easier to experiment with a smaller architecture. You may need many training runs to figure out hyperparameters for example. If each run needs multiple GPUs for a week the cost adds up quickly. I think it makes a lot of sense to start small.