| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sillysaurusx 2338 days ago

Proof that a TPUv2-8 can do 300GB of backprop: https://twitter.com/theshawwn/status/1196183733755355138

Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.

When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.

I use this technique regularly. All you have to do is tf.device(None): # ops go here

The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.

(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)

For example, right now we're training GPT-2 117M with a 25k context window on 47 TPUv3-8's: https://tensorboard.dev/experiment/idXs4PGOTEe1Jl6g3tq4qA/

25k context window is far, far out of reach of any GPU for GPT-2.

You can verify this is true by fine-tuning GPT-2 1.5B on Colab using a TPUv2-8: https://colab.research.google.com/drive/1BXry0kcm869-RVHHiY6...

If a TPUv2-8 only had access to 8GB, it would be impossible to train GPT-2 1.5B, let alone using Adam with a batch size > 1.

EDIT: Here's a simpler notebook: https://colab.research.google.com/drive/1ohuxvB7nuvcjpLLIF1L...

  !git clone https://github.com/shawwn/gpt-2 /content/gpt-2
  %cd gpt-2
  !pip3 install -r requirements.txt
  !python3 download_model.py 1558M
  !python3 train.py --dataset train.py --model_name 1558M --optimizer adam --batch_size 4

GPT-2 1.5B with Adam + batch size 4 works great on a TPUv2-8. https://i.imgur.com/w8T5CQI.png

2 comments

p1esk 2337 days ago

I don't get your excitement. How is this different from using 8xGPU box? If you use eight Quadro 8000 cards you have access to 384GB of memory to train your models.

link

sillysaurusx 2337 days ago

Mostly because TPUs are in reach of hobbyists. After all, it runs on Colab for free.

In a business context, TPUs seem far cheaper. A preemptible TPUv2-8 only costs $1.35/hr. It looks like 8x Quadro 8000's would cost >$40k.

link

p1esk 2337 days ago

Colab is great, can’t argue with free, but in a business context if you look here https://cloud.google.com/tpu/pricing#pricing_example_using_a...

the TPU equivalent of 8x quadro 8000 would be something between tpu v2-32 and tpu v3-32, and the monthly cost of tpu v2-32 is ~$8k. Plus the cost of a beefy VM. Assuming the GPU build sets you back ~$60k, it will start saving you $8k/mo after 6 months.

link

sillysaurusx 2337 days ago

A single TPUv2-8 matches 8x quadro 8000 in terms of available memory. (Sort of; the available memory is 300GB, whereas for 8x quadro 8000 it's 384GB.)

TPU pods actually don't require a beefy VM; I'm using a 2GB RAM one.

link

p1esk 2337 days ago

In the link I posted: tpu v2-8 has 64GB of total memory, v2-32 has 256GB.

As for the beefy vm - can you do heavy data preprocessing on tpus? For example elastic distortions or scaling for images? Probably not, because usually it involves OpenCV or similar libraries.

link

sillysaurusx 2337 days ago

The link is talking about per-core memory. A TPUv2-8 has 300GB system memory, which you can use for training. You can verify this using the notebooks above.

(If a TPUv2-8 has 64GB memory, how can it fine tune GPT-2 1.5B using Adam with batch size 4? That requires almost 300GB.)

link

octbash 2337 days ago

How are you using GPT-2 with an expanded context window? I was under the impression that the maximum context window was fixed.

link

sillysaurusx 2337 days ago

I wrote code to repeat the wpe variable N times along the context axis during model load time.

Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.

At that point, you can just set context window to a larger value, then train.

link

octbash 2337 days ago

Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)

link

sillysaurusx 2333 days ago

Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.

Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.

The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.

link