Hacker News new | ask | show | jobs
by david-gpu 824 days ago
> I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency

Congratulations, it sounds fascinating. Looking forward to seeing your contributions to pyTorch.

1 comments

I don't think I'm saying anything revolutionary or derogatory when I say that e.g. linear algebra with big batches of small complex-valued matrices, or thin/very-tall matrix multiplication, or 1D-complex convolutions with large filters are not in the main path of the NVIDIA engineers (I did say 'niche').

Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.

What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...

I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).

Sorry, I thought this article/thread was all about pyTorch/AI and NVidia's moat in this area vs AMD and other competitors, so my comments are written in that specific context.

If I have lost track of the conversation, please accept my apologies.

Heh, I might have veered off topic too... working at NVIDIA... having serious nerd envy here ;-)