I think it’s probably because the benchmark isn’t optimized for TPU Pods. Check out the BERT in 76 minutes paper for how you need to rethink the training regime to take advantage of pods.
Yes, Cloud TPU Pods are designed to train much larger models on much larger datasets. And, as you mention, if you are willing to adjust your model architectures and training algorithms to take full advantage of the hardware, you can sometimes achieve substantial gains.