Hacker News new | ask | show | jobs
by jacobn 1491 days ago
Grouped convolutions can't really run faster than groups * conv(ch/group) and I believe that's close to where they're at?

Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.

So unfortunately depthwise convolutions end up having terrible performance.

1 comments

Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.

Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).

pointwise isn’t a case of grouped conv, they’re orthogonal ideas.

You can fuse grouped convs (depthwise is a special case of grouped convs) into preceding or following layers. Maybe JAX can do this already? No clue if any library offers such an optimization out of the box

Sorry, yes, I was replying to the post about depthwise convolution and that’s what I meant (though the naming of it is poor) - i.e. the special case of group convolutions where the number of groups is equal to the number of channels.