| HN Mirror

I don't think I'm saying anything revolutionary or derogatory when I say that e.g. linear algebra with big batches of small complex-valued matrices, or thin/very-tall matrix multiplication, or 1D-complex convolutions with large filters are not in the main path of the NVIDIA engineers (I did say 'niche').

Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.

What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...

I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).