| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cameldrv 2819 days ago
	I think that the reason no one implements Krizhevsky's OWT (at least in normal training scripts, there's nothing stopping you from doing this in TensorFlow) is that the model parallelism in OWT is only useful where you have more weights than inputs/outputs to a layer. This was true for the FC layers in AlexNet, but hardly anyone uses large FC layers anymore.

1 comments

p1esk 2819 days ago

Model parallelism is also useful in situation where your model (and/or your inputs) is so large that even with batch_size=1 it does not fit in GPU memory (especially if you're still using 1080Ti). However other techniques might help here (e.g. gradient checkpointing, or dropping parts of your graph to INT8).

link