| Proof that a TPUv2-8 can do 300GB of backprop: https://twitter.com/theshawwn/status/1196183733755355138 Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU. When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say. I use this technique regularly. All you have to do is tf.device(None): # ops go here The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models. (I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.) For example, right now we're training GPT-2 117M with a 25k context window on 47 TPUv3-8's: https://tensorboard.dev/experiment/idXs4PGOTEe1Jl6g3tq4qA/ 25k context window is far, far out of reach of any GPU for GPT-2. You can verify this is true by fine-tuning GPT-2 1.5B on Colab using a TPUv2-8: https://colab.research.google.com/drive/1BXry0kcm869-RVHHiY6... If a TPUv2-8 only had access to 8GB, it would be impossible to train GPT-2 1.5B, let alone using Adam with a batch size > 1. EDIT: Here's a simpler notebook: https://colab.research.google.com/drive/1ohuxvB7nuvcjpLLIF1L... !git clone https://github.com/shawwn/gpt-2 /content/gpt-2
%cd gpt-2
!pip3 install -r requirements.txt
!python3 download_model.py 1558M
!python3 train.py --dataset train.py --model_name 1558M --optimizer adam --batch_size 4
GPT-2 1.5B with Adam + batch size 4 works great on a TPUv2-8. https://i.imgur.com/w8T5CQI.png |