|
|
|
|
|
by sillysaurusx
2338 days ago
|
|
Because TPUs are the only way to fit 300GB backprop onto a single device. You literally can't train models on GPUs when they require 300GB for backprop. Not unless you do model parallelization, which isn't always possible (and is significantly more engineering effort than "just run the model"). When you have policies like this, you lose out on such advantages. Especially for infrastructure purposes. |
|
A TPU v3 has 16 GB of high-bandwidth memory per TPU core: https://cloud.google.com/tpu/docs/system-architecture
Sure, you can network together a bunch of TPUs to get access to more memory (in either a data parallel or model parallel way), but that doesn't give you more memory on the same chip. It's basically the same way you would do things on a GPU cluster.