Hacker News new | ask | show | jobs
by david-gpu 825 days ago
I used to work at NVidia on the design of their tensor cores. As you can imagine, I had to be rather familiar with various kinds of high performance kernels that people are talking about in this thread.

I second the GP: nobody in their right mind would try to compete with the performance or functionality of libraries like cuDNN/ or cuBLAS.

NVidia pays for an army of exceptionally skilled folks to write these high performance kernels, working hand in hand with the architects that design the hardware, and with access to various sophisticated tools and performance models beyond what is available to the general public.

It would be like trying to compete against Olympians, to use an analogy that we can all understand.

1 comments

I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency, mostly because our problem-sizes are not in the cone of optimization of the Olympians of NVIDIA. Large batches of small matrices, specific matrix forms, long kernel pipelines...

Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.

I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.

Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.

> we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass.

I know you probably don't mean to say that Nvidia can't write good CUDA, but this does sort of illustrate how hard that is. I've seen similar cases (tiny matrix multiplied by enormous matrix) in which it was possible to write something faster than Nvidia's library. I'm not sure if this has been addressed since though.

> they can be beaten on memory bandwidth with pretty lowly-optimized kernels

This is partly why I believe most CUDA code probably isn't "good" - there's this enormous gulf between acceptable and good which often isn't worth crossing.

I meant to say that they optimized deeply for known and popular use cases and that it doesn't take ungodly amount of expertise to perform better, depending on the way you express your problem or its dimensions or whatever they didn't cover -edit to add- if your use-case doesn't fit.

I also meant to say that the domain is full of low hanging fruits if your problem doesn't fit whatever NVIDIA didn't optimize deeply. An intern may beat the cuXXX libraries with a little work and you can work up to max perf, yes, with serious effort.

There is probably thousands of man hours plunked in BLAS on Intel hardware and anyone who seriously tried to do AVX2/AVX512 knows it's hard to reach actual max perf on all problems. Yet I don't read 'only Intel experts can code efficient code'. It's no more true for CUDA than other parrallel or memory-weird architectures I've worked on. Yes it's different, but getting max perf has always been hard on any modern hardware.

As for the gulf between acceptable and good, the problem is similar here too: people stop when they've reached their goal or feel they can scale more efficiently by other means. I really don't see the difference with heavily optimized x86 stuff. We keep seeing new stuff you can do to improve AVX512 code or new places where you can apply it (JSON parsing, utf validation...) and it's been out for a while too. There hasn't been any free lunch there for a long, long time.

> I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency

Congratulations, it sounds fascinating. Looking forward to seeing your contributions to pyTorch.

I don't think I'm saying anything revolutionary or derogatory when I say that e.g. linear algebra with big batches of small complex-valued matrices, or thin/very-tall matrix multiplication, or 1D-complex convolutions with large filters are not in the main path of the NVIDIA engineers (I did say 'niche').

Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.

What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...

I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).

Sorry, I thought this article/thread was all about pyTorch/AI and NVidia's moat in this area vs AMD and other competitors, so my comments are written in that specific context.

If I have lost track of the conversation, please accept my apologies.

Heh, I might have veered off topic too... working at NVIDIA... having serious nerd envy here ;-)