|
|
|
|
|
by shoyer
2338 days ago
|
|
I think TPUs are great, but I don't understand what you mean by saying they can support "300 GB backprop" on a single device. A TPU v3 has 16 GB of high-bandwidth memory per TPU core:
https://cloud.google.com/tpu/docs/system-architecture Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster. |
|
Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.
When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.
I use this technique regularly. All you have to do is tf.device(None): # ops go here
The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.
(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)
For example, right now we're training GPT-2 117M with a 25k context window on 47 TPUv3-8's: https://tensorboard.dev/experiment/idXs4PGOTEe1Jl6g3tq4qA/
25k context window is far, far out of reach of any GPU for GPT-2.
You can verify this is true by fine-tuning GPT-2 1.5B on Colab using a TPUv2-8: https://colab.research.google.com/drive/1BXry0kcm869-RVHHiY6...
If a TPUv2-8 only had access to 8GB, it would be impossible to train GPT-2 1.5B, let alone using Adam with a batch size > 1.
EDIT: Here's a simpler notebook: https://colab.research.google.com/drive/1ohuxvB7nuvcjpLLIF1L...
GPT-2 1.5B with Adam + batch size 4 works great on a TPUv2-8. https://i.imgur.com/w8T5CQI.png