Hacker News new | ask | show | jobs
by oneshot908 2820 days ago
It's been this way since day 1. NVLINK remains the only real Tesla differentiator (although mini NVLINK is available on the new Turing consumer GPUs so WTFever). But because none of the DL frameworks support intra-layer model parallelism, all of the networks we see tend to run efficiently in data parallel because doing anything else makes them communication-limited, which they aren't because data scientists end up building networks that aren't, chicken and the egg style.

I continue to be boggled that Alex Krizhevsky's One Weird Trick never made it to TensorFlow or anywhere else:

https://arxiv.org/abs/1404.5997

I also suspect that's why so many thought leaders consider ImageNet to be solved, when what's really solved is ImageNet-1K. That leaves ~21K more outputs on the softmax of the output layer for ImageNet-22K, which to my knowledge, is still not solved. A 22,000-wide output sourced by a 4096-wide embedding is 90K+ parameters (which is almost 4x as many parameters in the entire ResNet-50 network).

All that said, while it will always be cheaper to buy your ML peeps $10K quad-GPU workstations and upgrade their consumer GPUs whenever a brand new shiny becomes available, be aware NVIDIA is very passive aggressive about this following some strange magical thinking that this is OK for academics, but not OK for business. My own biased take is it's the right solution for anyone doing research, and the cloud is the right solution for scaling it up for production. Silly me.

2 comments

I think that the reason no one implements Krizhevsky's OWT (at least in normal training scripts, there's nothing stopping you from doing this in TensorFlow) is that the model parallelism in OWT is only useful where you have more weights than inputs/outputs to a layer. This was true for the FC layers in AlexNet, but hardly anyone uses large FC layers anymore.
Model parallelism is also useful in situation where your model (and/or your inputs) is so large that even with batch_size=1 it does not fit in GPU memory (especially if you're still using 1080Ti). However other techniques might help here (e.g. gradient checkpointing, or dropping parts of your graph to INT8).
22,000-wide output sourced by a 4096-wide embedding

You will want to use hierarchical outputs in this case. Take a look at Hinton's 'Knowledge Distillation' paper.

Sure, that's a nice approximation, and it will reduce performance somewhat IMO, it'd be nice to quantify how much, no?