|
|
|
|
|
by oneshot908
2820 days ago
|
|
It's been this way since day 1. NVLINK remains the only real Tesla differentiator (although mini NVLINK is available on the new Turing consumer GPUs so WTFever). But because none of the DL frameworks support intra-layer model parallelism, all of the networks we see tend to run efficiently in data parallel because doing anything else makes them communication-limited, which they aren't because data scientists end up building networks that aren't, chicken and the egg style. I continue to be boggled that Alex Krizhevsky's One Weird Trick never made it to TensorFlow or anywhere else: https://arxiv.org/abs/1404.5997 I also suspect that's why so many thought leaders consider ImageNet to be solved, when what's really solved is ImageNet-1K. That leaves ~21K more outputs on the softmax of the output layer for ImageNet-22K, which to my knowledge, is still not solved. A 22,000-wide output sourced by a 4096-wide embedding is 90K+ parameters (which is almost 4x as many parameters in the entire ResNet-50 network). All that said, while it will always be cheaper to buy your ML peeps $10K quad-GPU workstations and upgrade their consumer GPUs whenever a brand new shiny becomes available, be aware NVIDIA is very passive aggressive about this following some strange magical thinking that this is OK for academics, but not OK for business. My own biased take is it's the right solution for anyone doing research, and the cloud is the right solution for scaling it up for production. Silly me. |
|