This paper describes one part of the fbcunn release (the fast convolution layers implemented via FFT, with the source available at https://github.com/facebook/fbcunn/tree/master/src/cuda/fft). There's a lot more in fbcunn if you want to check it out.
In my experience, the fully connected layers are the bottleneck. The other issue was the alternating compute-heavy convolution and the IO-heavy pooling. I'm curious how this FFT implementation stacks up against cuDNN (what's the speedup like for just the convolutional layers? and then what's the overall speedup like?).