Hacker News new | ask | show | jobs
by fancyfredbot 825 days ago
I am not sure CUDA is the moat, but yes, software is the moat.

To first order nobody writes any CUDA, and even if you do you are probably bad at it. The language is slightly easier to use than openCL but writing really performant code is still a nightmare (a pipeline of asynchronous memory copies from global to shared memory is not easy to program but this is a requirement for full performance on tensor cores).

So no, the moat really isn't the language. It's not even the libraries, it's the integration of the libraries into third party software like pytorch, jax, etc. This is the truly massive advantage NVIDIA has, and they got it by being early and by being installed in an awful lot of machines.

3 comments

”To first order nobody writes any CUDA and even if you do you are probably bad at it” is such an anti-intellectual stance that is repeated to such a large extent that it irks me. It’s the authors protecting their ego and is said about everything they don’t understand. It is said about compilers, about static typing, about pretty anything the authors do not yet know.

At least say why people wouldn’t be good at it. The documentation is poor, the GPUs are a black box or anything in that vein. Then they can help you learn instead of preemptively dismiss it.

I used to work at NVidia on the design of their tensor cores. As you can imagine, I had to be rather familiar with various kinds of high performance kernels that people are talking about in this thread.

I second the GP: nobody in their right mind would try to compete with the performance or functionality of libraries like cuDNN/ or cuBLAS.

NVidia pays for an army of exceptionally skilled folks to write these high performance kernels, working hand in hand with the architects that design the hardware, and with access to various sophisticated tools and performance models beyond what is available to the general public.

It would be like trying to compete against Olympians, to use an analogy that we can all understand.

I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency, mostly because our problem-sizes are not in the cone of optimization of the Olympians of NVIDIA. Large batches of small matrices, specific matrix forms, long kernel pipelines...

Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.

I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.

Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.

> we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass.

I know you probably don't mean to say that Nvidia can't write good CUDA, but this does sort of illustrate how hard that is. I've seen similar cases (tiny matrix multiplied by enormous matrix) in which it was possible to write something faster than Nvidia's library. I'm not sure if this has been addressed since though.

> they can be beaten on memory bandwidth with pretty lowly-optimized kernels

This is partly why I believe most CUDA code probably isn't "good" - there's this enormous gulf between acceptable and good which often isn't worth crossing.

I meant to say that they optimized deeply for known and popular use cases and that it doesn't take ungodly amount of expertise to perform better, depending on the way you express your problem or its dimensions or whatever they didn't cover -edit to add- if your use-case doesn't fit.

I also meant to say that the domain is full of low hanging fruits if your problem doesn't fit whatever NVIDIA didn't optimize deeply. An intern may beat the cuXXX libraries with a little work and you can work up to max perf, yes, with serious effort.

There is probably thousands of man hours plunked in BLAS on Intel hardware and anyone who seriously tried to do AVX2/AVX512 knows it's hard to reach actual max perf on all problems. Yet I don't read 'only Intel experts can code efficient code'. It's no more true for CUDA than other parrallel or memory-weird architectures I've worked on. Yes it's different, but getting max perf has always been hard on any modern hardware.

As for the gulf between acceptable and good, the problem is similar here too: people stop when they've reached their goal or feel they can scale more efficiently by other means. I really don't see the difference with heavily optimized x86 stuff. We keep seeing new stuff you can do to improve AVX512 code or new places where you can apply it (JSON parsing, utf validation...) and it's been out for a while too. There hasn't been any free lunch there for a long, long time.

> I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency

Congratulations, it sounds fascinating. Looking forward to seeing your contributions to pyTorch.

I don't think I'm saying anything revolutionary or derogatory when I say that e.g. linear algebra with big batches of small complex-valued matrices, or thin/very-tall matrix multiplication, or 1D-complex convolutions with large filters are not in the main path of the NVIDIA engineers (I did say 'niche').

Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.

What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...

I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).

Sorry, I thought this article/thread was all about pyTorch/AI and NVidia's moat in this area vs AMD and other competitors, so my comments are written in that specific context.

If I have lost track of the conversation, please accept my apologies.

I gave an example of why people wouldn't be good at it with the pipelined asynchronous memory copies. Take a look at link below to the documentation. It's just plain difficult to do something as basic as move data into shared memory efficiently. Others have given far more detailed responses.

You probably won't like this, but I'm also going to suggest you take a look at the HN guidelines about assuming good faith, and around responding to the argument instead of calling names. My comment might have irked you but that's not actually a basis for deciding I'm anti intellectual, that I'm protecting my ego, and that I really just need someone to help me learn.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

I've worked in one of the top computing labs, with top GPU computing startups, have investor money from Nvidia, wrote CUDA for years, and hire people to do write GPU code. And would say, most people -- even Nvidia employees and our own -- are individually bad at writing good CUDA code: it takes a highly multi-skilled team working together to make anything more than demoware. Most people who say they can write CUDA, when you scratch a little bit of the items I put below, you realize they can only for some basic one-offs. Think some finance person running one job for a month, but not at the equivalent of a senior python/java/c++ developer doing whatever reliable backend code they're hired to do that lives on.

To give a feel, while at Berkeley, we had an award-winning grad student working on autotuning CUDA kernels and empirically figuring out what does / doesn't work well on some GPUs. Nvidia engineers would come to him to learn about how their hardware and code works together for surprisingly basic scenarios.

It's difficult to write great CUDA code because it needs to excel in multiple specializations at the same time:

* It's not just writing fast low-level code, but knowing which algorithmic code to do. So you or your code reviewer needs to be an expert at algorithms. Worse, those algorithms are both high-level, and unknown to most programmers, also specific to hardware models, think scenarios like NUMA-aware data parallel algorithms for irregular computations. The math is generally non-traditional too, e.g., esoteric matrix tricks to manipulate sparsity and numerical stability.

* You ideally will write for 1 or more generations of architectures. And each architecture changes all sorts of basic constants around memory/thread/etc counts at multiple layers of the architecture. If you're good, you also add some sort of autotuning & JIT layers around that to adjust for different generations, models, and inputs.

* This stuff needs to compose. Most folks are good at algorithms, software engineering, or performance... not all three at the same time. Doing this for parallel/concurrent code is one of the hardest areas of computer science. Ex: Maintaining determinism, thinking through memory life cycles, enabling async vs sync frameworks to call it, handling multitenancy, ... . In practice, resiliency in CUDA land is ~non-existent. Overall, while there are cool projects, the Rust etc revolution hasn't happened here yet, so systems & software engineering still feels like early unix & c++ vs what we know is possible.

* AI has made it even more interesting nowadays. The types of processing on GPUs are richer now, multi+many GPU is much more of a thing, and disk IO as well. For big national lab and genAI foundation model level work, you also have to think about many racks of GPUs, not just a few nodes. While there's more tooling, the problem space is harder.

This is very hard to build for. Our solution early on was figuring out how to raise the abstraction level so we didn't have to. In our case, we figured out how to write ~all our code as operations over dataframes that we compiled down to OpenCL/CUDA, and Nvidia thankfully picked that up with what became RAPIDS.AI. Maybe more familiar to the HN crowd, it's basically the precursor and GPU / high-performance / energy-efficient / low-latency version of what the duckdb folks recently began on the (easier) CPU side for columnar analytics.

It's hard to do all that kind of optimization, so IMO it's a bad idea for most AI/ML/etc teams to do it. At this point, it takes a company at the scale of Nvidia to properly invest in optimizing this kind of stack, and software developers should use higher-level abstractions, whether pytorch, rapids, or something else. Having lived building & using these systems for 15 years, and worked with most of the companies involved, I haven't put any of my investment dollars into AMD nor Intel due to the revolving door of poor software culture.

Chip startups also have funny hubris here, where they know they need to try, but end up having hardware people run the show and fail at it. I think it's a bit different this time around b/c many can focus just on AI inferencing, and that doesn't need as much what the above is about, at least for current generations.

Edit: If not obvious, much of our code that merits writing with CUDA in mind also merits reading research papers to understand the implications at these different levels. Imagine scheduling that into your agile sprint plan. How many people on your team regularly do that, and in multiple fields beyond whatever simple ICML pytorch layering remix happened last week?

Thanks for the insight. Looks like the principle of “doing things that don’t scale” works surprisingly well even in the ML space.
Agreed.

If there is a niche that is at the intersection of multiple specialties, and it includes GPU acceleration, there is a good chance it is ripe for a startup to get an early mover advantage. Eg, real-time foundation models for audio around non-english/non-chinese that works small & offline in cars.

Unfortunately, Nvidia has a culture of open sourcing all CUDA code, so if any startup shows something works commercially, Nvidia will rewrite, likely ultimately better, and give away for free, so more companies will do it and buy more GPUs.

In your opinion, is it hopeless for something like ROCm to compete given that even CUDA is extremely hard for all parties?

What do you think about Apple's Metal?

If I was any of these companies, I'd totally invest many billions in ecosystem here. Tensorflow (Google) and pytorch (Facebook) are great examples, it can work. Otherwise, hw companies will continue to lose relevance in the growing server market, and SW companies will have an ever growing Nvidia tax.

But it's not easy for the hw co's. OpenCL was more of a hw company thing (Intel, AMD, mobile chip co's), and while they spend billions on adventures all the time, their SW leadership culture has been bad. They fail to do sustained & deep ecosystem investment, and instead look like small feudal orgs that get their projects pulled arbitrarily whenever the VPs rearrange themselves. For example, given that Intel brought back its old CEO, that was a scary signal to me for this front. Intel specifically had the internal talent, I'm not sure if they still do, just not at the management level, and definitely not culturally at the highest leadership level.

Jensen at Nvidia has always been a special CEO here, even when they were helping game companies make their engines, and I'm guessing that taught him the value of long-term vertical SW & ecosystem investment. Instead of Intel unifying on x86 and c++ (compilers, vtune, Intel tbb, ...), and letting Microsoft / Linux / DB people go higher, Jensen went all the way up the stack to get at full utilization, and unified teams internally on that over 1-2 decades.

Apple is a funnier case. I can see them doing it and then pulling the plug. Eg, Chris Lattner making Swift and then they failed to retain him, and their revolving door of frameworks overall. Internally, they do have the technical talent and $, but I don't understand the culture and commercial alignment.

Finally.. I do think the increasing importance of AI inferencing, yet simultaneous simplicity of it, has opened a disruption opportunity here. We are still at a tiny % of where it is going. Onyx, pytorch, transformers, etc ecosystem are still early days from that perspective. It's fast for a hardware co like Groq to port a new model. So I don't rule out big changes here, and those being used to drive the rest of the ecosystem, like your q on ROCm.

So do people write in CUDA? I assume non-ML scientists do but ML researchers don't?
I wrote some CUDA in Fortran for charged particle propagation in the Heliosphere.

Brought down some simulations from about 30min to under 1s.

ML researchers in statistics departments write stuff in R, which makes everyone scream. ML researchers absolutely do.

My point in the article was basically the class was "indoctrinating" (too strong, but you get the point) the future ML researchers in the superiority of using CUDA and spending NVIDIA company resources to continuously do so in these classes, year after year.

This hits the nail on the head. Nvidia got all the programmers excited about using their GPUs first and now they have all the software targeting their hardware.

If you could compile CUDA for Intel and AMD it's not going to perform well. When you program a GPU you aren't just writing task specific code, you are also writing hardware specific code. So having developer mindshare matters much more than having a nice programming language.

In ML many people write pytorch and not CUDA. But even in ML the choice of precision is driven by the data types Nvidia can deal with efficiently - this is a moat which is nothing to do with CUDA.

I a ML researcher and I do, so there is at least one. But granted there are not many. I guess it depends on how cutting edge you want to be.
I write CUDA C++ for signal processing, not related to ML in any way. It's a hard requirement from my customer.
Yes. Writing CUDA that calls CuBLAS or CUB is still writing CUDA. Lots of kernels and functions (functors, etc) are "business requirements" moreso than math libraries. It's no different than the CPU code world, there are far more CRUD apps than BLAS libraries written, and writing a CRUD app that calls a BLAS library doesn't mean you're not "writing CPU code". Someone has to write those systems of linear equations for BLAS to solve.

The world is deeper than just assembly and BLAS tuning, and you can get extremely far in CUDA just by gluing together the primitives they give. Python is popular in the AI/ML space, but far from the only way to do that.

Yeah in the ML space you don't need to, but in engineering, HPC its still really popular. Perhaps in some universe we'll replace C++ with ONNX.
or use tvm
> nobody writes any CUDA

That's an extreme stretch, and far from truth.

Many people write CUDA, both in industry and academia.

What I said was "to first order nobody writes any CUDA". Using "to first order" in that way is probably an abuse of terminology, but my intent was to say the majority of people using GPUs do not write CUDA, not that literally nobody does (which would be absurd).