Hacker News new | ask | show | jobs
by sillysaurusx 2338 days ago
Because TPUs are the only way to fit 300GB backprop onto a single device.

You literally can't train models on GPUs when they require 300GB for backprop. Not unless you do model parallelization, which isn't always possible (and is significantly more engineering effort than "just run the model").

When you have policies like this, you lose out on such advantages. Especially for infrastructure purposes.

1 comments

I think TPUs are great, but I don't understand what you mean by saying they can support "300 GB backprop" on a single device.

A TPU v3 has 16 GB of high-bandwidth memory per TPU core: https://cloud.google.com/tpu/docs/system-architecture

Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster.

Proof that a TPUv2-8 can do 300GB of backprop: https://twitter.com/theshawwn/status/1196183733755355138

Think of a TPU as a box with a CPU, RAM, and eight GPUs. In the same way that you can run code on either the GPUs or the CPU, you can run code on the TPU's CPU.

When you run code on the TPU's CPU, you have access to up to 300GB before OOMing. It's distinct from running on the TPU cores, which gives you only 8GB for TPUv2 and 16GB for TPUv3, as you say.

I use this technique regularly. All you have to do is tf.device(None): # ops go here

The TPU's CPU is pretty fast. Normally it's only used for input pipeline transformations. I have no idea why. We use it for actual backprop on massive models.

(I call this "coreless mode" because "TPU's CPU" is a confusing mouthful.)

For example, right now we're training GPT-2 117M with a 25k context window on 47 TPUv3-8's: https://tensorboard.dev/experiment/idXs4PGOTEe1Jl6g3tq4qA/

25k context window is far, far out of reach of any GPU for GPT-2.

You can verify this is true by fine-tuning GPT-2 1.5B on Colab using a TPUv2-8: https://colab.research.google.com/drive/1BXry0kcm869-RVHHiY6...

If a TPUv2-8 only had access to 8GB, it would be impossible to train GPT-2 1.5B, let alone using Adam with a batch size > 1.

EDIT: Here's a simpler notebook: https://colab.research.google.com/drive/1ohuxvB7nuvcjpLLIF1L...

  !git clone https://github.com/shawwn/gpt-2 /content/gpt-2
  %cd gpt-2
  !pip3 install -r requirements.txt
  !python3 download_model.py 1558M
  !python3 train.py --dataset train.py --model_name 1558M --optimizer adam --batch_size 4
GPT-2 1.5B with Adam + batch size 4 works great on a TPUv2-8. https://i.imgur.com/w8T5CQI.png
I don't get your excitement. How is this different from using 8xGPU box? If you use eight Quadro 8000 cards you have access to 384GB of memory to train your models.
Mostly because TPUs are in reach of hobbyists. After all, it runs on Colab for free.

In a business context, TPUs seem far cheaper. A preemptible TPUv2-8 only costs $1.35/hr. It looks like 8x Quadro 8000's would cost >$40k.

Colab is great, can’t argue with free, but in a business context if you look here https://cloud.google.com/tpu/pricing#pricing_example_using_a...

the TPU equivalent of 8x quadro 8000 would be something between tpu v2-32 and tpu v3-32, and the monthly cost of tpu v2-32 is ~$8k. Plus the cost of a beefy VM. Assuming the GPU build sets you back ~$60k, it will start saving you $8k/mo after 6 months.

A single TPUv2-8 matches 8x quadro 8000 in terms of available memory. (Sort of; the available memory is 300GB, whereas for 8x quadro 8000 it's 384GB.)

TPU pods actually don't require a beefy VM; I'm using a 2GB RAM one.

How are you using GPT-2 with an expanded context window? I was under the impression that the maximum context window was fixed.
I wrote code to repeat the wpe variable N times along the context axis during model load time.

Specifically, the code checks whether the model's shape is greater than the shape from the snapshot on disk. If so, it repeats the shape from the snapshot on disk N times to fill the expected greater shape.

At that point, you can just set context window to a larger value, then train.

Is that essentially repeating the position embedding? I'm surprised that works, since the model should have no way to distinguish between the (e.g.) 1st and 513th token. (If I'm understanding this correctly.)
Yeah, it seems to "work" in the sense that if you repeat wpe size (1024, 768) three times, so that it becomes (3072, 768), then the model can successfully generate up to 1024 tokens.

Generating more tokens seems to work up to a point -- you can probably generate up to 1050 tokens with this technique. But at a certain point, more tokens = gibberish.

The cure is to train the new wpe layer the same way you'd train the smaller one. But this also means you don't have to start training from scratch.

A single TPU v3 has 8 cores, so that’s 128GB memory total, which is more than any single GPU currently.

The TPU software does data parallelism (in Tensorflow) transparently, and it’s somewhat easier to do model parallelism because the memory link is solid and requires no special setup / drivers. You’ll still get an OOM from XLA if you have a tensor that won’t fit in the 16GB of a single core.

TPU pods are easier to use than clusters of infiniband-linked volta boxes. For TPUs you just give GCE money and make some small changes to your use of the TPU API. For the volta cluster you’d probably need to bring your own orchestration (e.g. Horovod). So a TPU pod is easier for one person to use and admin currently.