| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Smerity 2303 days ago

For high performance CUDA kernels people still need to write derivatives by hand. I know this as for my own research, and for many production systems, I'd still need to write it myself. Many of my architectures wouldn't have been possible without writing the CUDA myself (Quasi-Recurrent Neural Network[1]) or using optimized hand written black boxes (cuDNN RNN). The lack of open optimized hand written CUDA kernels has actually been an impediment to progress in the field.

Automatic differentiation allows for great flexibility and composability but the performance is still far from good, even with the various JITs available. Jax seems to be one of the most flexible and optimized for many use cases for now however.

[1]: https://github.com/salesforce/pytorch-qrnn

3 comments

shoyer 2303 days ago

Right, you still need to write derivative rules by hand for the primitive operations of an auto-diff system. Automatic differentiation provides composition, it doesn't solve the root mathematical problem of differentiating operations at the lowest level.

So yes, if need a new primitive to add an efficient CUDA kernel, you will probably also have to write its derivative manually too. JAX has a few shortcuts that occasionally make this easier but fundamentally it has the same challenge as any auto-diff system.

link

Smerity 2303 days ago

I still strongly disagree. Few of these hand written CUDA kernels outside of the frameworks are about implementing derivative rules, they're about eliminating the CUDA call overheads or avoiding the layered computational / memory inefficiencies that existing ML compilers have trouble handling.

Next to none of the frameworks are yet able to JIT you a performant RNN, yet RNNs only use very standard components[1]. OpenAI had a massive speed and memory usage boost for attention by implementing what amounts to a few standard primitives together[2].

There are massive gaps in the optimizations that existing ML compilers provide. The landscape is starting to get better but it's still filled with many pitholes.

[1]: https://twitter.com/stanfordnlp/status/1224106217192087552

[2]: https://openai.com/blog/sparse-transformer/

link

6gvONxR4sf7o 2303 days ago

It depends what you define as primitive. I've had plenty of compositions of existing primitives for which the auto-derived backprop was orders of magnitude slower than a hand written one. I didn't need to write my own backprop, but I benefited tremendously from it. I don't think my experience is particularly rare.

link

backpropaganda 2303 days ago

But is autodiff combined with a blackbox jit a real solution? The jit either works for your new model or it does not. If it does not, you can do pretty much nothing about it, other than ping jax authors or get your own hands dirty with jax internal code. Why is noone working on a usable low-level framework, where I can implement QRNN or more complicated stuff without relying on a black-box jit? Jax could have chosen to be this, but instead is a fancy solution to a non-problem.

link

6gvONxR4sf7o 2303 days ago

How has your experience with CUDA been? Is it as painful as it appears at first glance? I've done a ton of python and C, and yet whenever I look at C++ code, it just screams stay away.

But I have some almost-reasonably-performant pytorch that I'd rather not just use as a cash burning machine, so it looks like it might be time to dive into CUDA :-\

link

Smerity 2303 days ago

The CUDA I've written has never been joyous but it also hasn't been as horrific as I'd expected. There's a period of hair pulling but persistence will get you through it. The majority of CUDA code is closer to C than C++ too which is helpful. I'll be looking at diving back into CUDA in the near future given the exact speed issues we've been mentioning so feel free to get in touch.

link