Hacker News new | ask | show | jobs
by bwasti 2479 days ago
Note that this a layout trick and not an algorithmic one. An algorithmic speed up that is good for dense convolutions with small kernels is to use Winograd: https://arxiv.org/abs/1509.09308 For large kernels, implementing an FFT tends to help.

Also worth keeping in mind that many modern networks use depthwise separable convolutions, which are channel wise convolutions (skipping a reduction over the channels, which is a memory bound operation) followed by 1x1 convolutions (which are exactly matrix multiplications with no im2col step).