Hacker News new | ask | show | jobs
by cameldrv 2819 days ago
I think that the reason no one implements Krizhevsky's OWT (at least in normal training scripts, there's nothing stopping you from doing this in TensorFlow) is that the model parallelism in OWT is only useful where you have more weights than inputs/outputs to a layer. This was true for the FC layers in AlexNet, but hardly anyone uses large FC layers anymore.
1 comments

Model parallelism is also useful in situation where your model (and/or your inputs) is so large that even with batch_size=1 it does not fit in GPU memory (especially if you're still using 1080Ti). However other techniques might help here (e.g. gradient checkpointing, or dropping parts of your graph to INT8).