|
|
|
|
|
by jrk
3036 days ago
|
|
[Edited] The top line results focus on comparing four TPUs in a rack node (which marketing cleverly named “one cloud TPU”), running ~16 bit mixed precision, to one GPU (out of 8 in a rack node), also capable of 16 bit or mixed precision, but handicapped to 32 bit IEEE 754. That is a misleading comparison. Images/$ are obviously more directly comparable, but again the emphasized comparisons are at different precision. Very different batch sizes make this significantly more misleading, still. Images/$ also only tells us that Google has chosen to look at the competition and set a competitive price; the per-die or per-package comparison is much more relevant to understand any intrinsic architectural advantage, since these are all large dies on roughly comparable process nodes. |
|
Depends on your metric, Jonathan! If you focus on the per dollar numbers, then it’s actually net favorable to the V100, because a second GPU over NVLINK won’t be as cost-efficient. If what you care about is raw throughput “in a single box”, then 8xV100 probably comes out ahead here.
Like someone else below though, I worry about the “hey wait a minute, changing the batch size just for the TPU seems unfair” and the whole “the LSTM didn’t converge” bit. Not a bad first draft, but hopefully the authors can do some more comparisons.