Hacker News new | ask | show | jobs
by bartwr 917 days ago
If they create a better tool chain, ecosystem, and programming experience than CUDA and compatible with all computational platforms at their peak performance - awesome! Everyone wins!

Until then, it's a bit funny claim, especially considering what a failure OpenCL was (programmer's experience and fading support). Or trying to do GPGPU with compute shaders in DX/GL/Vulkan. Are they really "motivated"? Because they had so many years and the results are miserable... And I don't think they invested even a fraction of what got invested into CUDA. Put your money where your mouth is.

2 comments

I wish AMD or Intel would just ship a giant honking CPU with 1000s of cores that doesn't need any special purpose programming languages to utilize. Screw co-processors. Screw trying to make yet another fucked up special purpose language -- whether that's C/C++-with-quirks or a half-assed Python clone or whatever. Nuts to that. Just ship more cores and let me use real threads in regular programming languages.
It doesn't work if you're going against GPUs. All the nice goodies we are accustomed to on large desktop x86 machines with gigantic caches and huge branch predictor area and OOO execution engines -- the features that yield the performance profile we expect -- simply do not translate or scale up to thousands of cores per die. To scale that up, you need to redesign the microarchitecture in a fundamental way to allow more compute-per-mm^2 of area, but at that point none of the original software will work in any meaningful capacity because the pipeline is so radically different, it might as well be a different architecture entirely. That means you might as well just write an entirely different software stack, too, and if you're rewriting the software, well, a different ISA is actually the easy part. And no, shoving sockets on the mobo does not change this; it doesn't matter if it's a single die or multi socket. The same dynamics apply.
While the first >1000 core x86 processor is probably a little ways out, Intel is releasing a 288-core x86 processor in the first half of 2024 (Sierra Forest). I assume AMD will have something similarly high core in 2024-25 as well.
To be clear, you can probably make a 1000 core x86 machine, and those 1000 cores can probably even be pretty powerful. I don't doubt that. I think Azure even has crazy 8-socket multi-sled systems doing hundreds of cores, today. But this thread is about CUDA. Sierra Forest will get absolutely obliterated by a single A100 in basically any workload where you could reasonably choose between the two as options. I'm not saying they can't exist. Just that they will be (very) bad in this specific competition. I made an edit to my comment to reflect that.

But what you mention is important, and also a reason for the ultimate demise of e.g. Xeon Phi. Intel surely realized they could just scale their existing Xeon designs up-and-out further than expected. Like from a product/SKU standpoint, what is the point of having a 300 core Phi where every core is slow as shit, when you have a 100 core 4-socket Xeon design on the horizon, using an existing battle-tested design that you ship billions of dollars worth every year? Especially when the 300 core Xeon fails completely against the competition. By the time Phi died, they were already doing 100-cores-per-socket systems. They essentially realized any market they could have had would be served better by the existing Xeon line and by playing to their existing strengths.

> Intel is releasing a 288-core x86

This made me wonder a couple of things-

What kind of workloads and problems is that best suited for? It’s a lot of cores for a CPU, but for pure math/compute, like with AI training and inference and with graphics, 288 cores is like ~1.5% of the number of threads of a modern GPU, right? Doesn’t it take particular kinds of problems to make a 288 core CPU attractive?

I also wondered if the ratio of the highest core count CPU to GPU has been relatively flat for a while? Which way is it trending- which of CPUs or GPUs are getting more cores faster?

You could do sparse deep learning with much, much larger models with these CPUs. As paradoxical as it might sound, sparse deep learning gets more compute bound as you add more cores.
I'd be curious to learn more about how it's compute bound and what specifically is compute bound. On modern H100s you need ~600 fp8 operations per byte loaded from memory in order to be compute bound, and that's with full 128-byte loads each time. Even integer/fp32 vector operations need quite a few operations to be compute bound (~20 for vector fp32).
288 Cores or Threads? Cuz to my knowledge AMD already has a 128 Core, 256 Thread Processor with the Epyc 9754
Apple might be sort-of trying to build the honking CPU, but it still requires different language extensions and a mix of different programming models.

And what you suggest could be done, but it would likely flop commercially if you made it today, which is why they aren’t doing it. SIMD machines are faster on homogenous workloads, by a lot. It would be a bummer to develop a CPU with thousands of cores that is still tens or hundreds of times slower than a comparably priced GPU.

SIMD isn’t going away anytime soon, or maybe ever. When the workload is embarrassingly parallel, it’s cheaper and more efficient to use SIMD over general purpose cores. Specialized chiplets and co-processors are on the rise too, co-inciding with the wane of Moore’s law; specialization is often the lowest hanging fruit for improving efficiency now.

There’s going to be plenty of demand for general programmers but maybe worth keeping in mind the kinds of opportunities that are opening up for people who can learn and develop special purpose hardware and software.

Well, that is what a GPU is. Cuda / openmp etc are attempts at conveniently programming a mixed cpu/gpu system.

If you don't want that, program the GPU directly in assembly or C++ or whatever. A kernel is a thread - program counter, register file, independent execution from the other threads.

There isn't a Linux kernel equivalent sitting between you and the hardware so it's very like bare metal x64 programming, but you could put a kernel abstraction on it if you wanted.

Core isn't very well defined, but if we go with "number of independent program counters live at the same time" it's a few thousand.

X64 cores are vaguely equivalent to GCN compute units, 100 or so if either in a 300W envelope. X64 has two threads and a load of branch prediction / speculation hardware. GCN has 80 threads and swaps between them each cycle. Same sort of idea, different allocation of silicon.

It was called Larrabee and XeonPhi, they botched it, and the only thing left from that effort is AVX.
I used to play with these toys 7-8 years ago. We tried everything, and it was bad at it all.

Traditional compute? The cores were too weak.

Number crunching? Okay-ish but gpus were better.

Useless stuff.

They seemed exceedingly hard to use well but interestingly capable & full of promise. And they were made in a much more primitive software age.

I'd love to hear about what didn't work. OpenMP support seemed ok maybe but OpenMP is just a platform, figuring out software architectures that's mechanistically sympathetic to the system is hard. It would be so interesting to see what Xeon Phi might have been if we had Calcite or Velox or OpenXLA or other execution engine/optimizers that can orchestrate usage. The possibility of something like Phi seems so much higher now.

There's such a consensus around Phi tanking, and yes, some people came and tried and failed. But most of those lessons, of why it wasn't working (or was!) never survived the era, never were turned into stories & research that illuminates what Phi really was. My feeling is that most people were staying the course on GPU stuff, and that there weren't that many people trying Phi. I'd like more than the heresay heaped at Phi's feed to judge by.

Well... Back then in my shop they would just assign programmers to things, together with a couple of mathematicians.

Math guys came up with a list of algorithms to try for a search engine backend.

What we needed was matrix multiplication and maybe some decision tree walking (that was some time ago, trees were still big back then, NNs were seen as too compute-intensive for no clear benefits). So we thought that it might be cool to have a tool that would support both. Phi sounded just right for both.

And things written to AVX-512 did work. Software surpisingly easy to port.

But then comes the usual SIMD/CPU trouble: every SIMD generation wants a little software rewrite. So for both Phi generations we had to update our code. For things not compatible with the SIMD approach (think tree-walking) it is just a weak x86.

In theory Phi's were universal, in practice what we got was: okay number crunching, bad generic compute.

GPU was somewhat similar: the software stack was unstable, CUDA just did not materialize as a standard yet. But every generation introduced a massive increase in compute available. And boy did NVIDIA move fast...

So GPU situation was: amazing number crunching, no generic compute.

And then there were a few ML breakthroughs results which rendered everything that did not look like a matrix multiplication obsolete.

PS I wouldn't take this story too seriously, details may vary.

By any chance, Yandex?
Some observations:

- Very bad performance at existing x86 workloads, so a major selling point was basically not there in practice, because extracting any meaningful performance required a software rewrite anyway. This was an important adoption criteria; if they outright said "All your existing workloads are compatible, but will perform like complete dogshit", why would anyone bother? Compatibility was a big selling point that ended up meaning little in practice, unfortunately.

- Not actually what x86 users wanted. This was at the height of "Intel stagnation" and while I think they were experimenting with lots of stuff, well, in this case, they were serving a market that didn't really want what they had (or at least wasn't convinced they wanted it).

- GPU creators weren't sitting idle and twiddling their thumbs. Nvidia was continuously improving performance and programmability of their GPUs across all segments (gaming, HPC, datacenters, scientific workloads) while this was all happening. They improved their compilers, programming models, and microarchitecture. They did not sit by on any of these fronts.

Ironically the main living legacy of Phi is AVX-512, which people did and still do want. But that kind of gives it all away, doesn't it? People didn't want a new massively multicore microarchitecture. They wanted new vector instructions that were flexible and easier to program than what they had -- and AVX-512 is really much better. They wanted the things they were already doing to get better, not things that were like, effectively a different market.

Anyway, the most important point is probably the last one, honestly. Like we could talk a lot about compiler optimizations or autovectorization. But really, the market that Phi was trying to occupy just wasn't actually that big, and in the end, GPUs got better at things they were bad at, quicker than Phi got better at things it was bad at. It's not dissimilar to Optane. Technically interesting, and I mourn its death, but the competition simply improved faster than the adoption rate of the new thing, and so flash is what we have.

Once you factor in that you have to rewrite software to get meaningful performance uplift, the rest sort of falls into place. Keep in mind that if you have a $10,000 chip and you can only extract 50% of the performance, you more or less have just $5,000 on fire for nothing in return. You might as well go all the way and use a GPU because at least then you're getting more ops/mm^2 of silicon.

I don't disagree anywhere but I don't think any of these statements actually condemn Xeon Phi outright. It didn't work at the time, and doing it with so little software support to tile out workloads well was a big & possibly bad gambit, but I'm so unsure we can condemn the architecture. There seems to be so few folks who made good attempts and succeeded or failed & wrote about it.

I tend to think there was tons of untapped potential still on the table. And that a failure to adopt potential isn't purely Intel alone's fault. The story we are commenting on is about the rest-of-industry trying to figure out enduring joint strategies, and much of this is chipmaker provided, but it is also informed and helped by plenty of consumers also pouring energy in to figure out what's working and not, trying to push the bounds.

Agreed that anyone going in thinking Xeon Phi would be viable for running a boring everyday x86 workload was going to be sad. To me the promise seemed clear that existing toolchains & code would work, but it was always clear to me there were a bunch of little punycores & massive SIMD units and that doing anything not SIMD intensive wasn't going to go well at all. But what's the current trend? Intel and AMD are both actively building not punycores but smaller cores, with Sierra Forest and Bergamo. E-cores are the grown up Atom we saw here.

Yes the GPGPU folks were winning. They had a huge head start, were the default option. And Intel was having trouble delivering nodes. So yes, Xeon Phi was getting trounced for real reasons. But they weren't architectural issues! It just means the Xeon Phi premise was becoming increasingly handicapped.

As I said I broadly agree everywhere. Your core point about giving the market more of what it already does is well taken, is a river of wisdom we see again and again. But I do think conservative thinking, iterating along, is dangerous thinking that obstructs us from seeing real value & possibility before us. Maybe Intel could have made a better ML chip than the GPGPU market has gotten for years, had things gone differently; I think the industry could perhaps have been glad they had veered onto a new course, but the barriers to that happening & the slow down in Intel delivery & the difficulty bootstrapping new software were all horrible encumberances which were rightly more than was worth bearing together.

Hence why " they botched it".
The closest Intel got to this was Xeon Phi / Knights Landing https://en.wikipedia.org/wiki/Xeon_Phi with 60+ cores per chip, each able to run 4 threads simultaneously - each of which could run arbitrary x86 code. Discontinued due to low demand in 2020 though.

In practice, people weren’t afraid to roll up their sleeves and write CUDA code. If you wanted good performance you had to think about data parallelism anyways, and at that point you’re not benefiting from x86 backwards compatibility. It was a fascinating dream while it lasted though.

AVX might be going the right direction, even if the AVX512 was stretch too far. I was impressed by llama.cpp performance boost when AVX1 support was added.

There's no intrinsic reason why multiplying matrices requires massive parallelism, in principle it could be done on few cores plus good management of ALUs/memory bandwidth/caches.

What's wrong with compute shaders ?
I shipped a dozen products with them (mostly video games), so there's nothing "wrong" that would make them unusable. But programming them and setting up the graphics pipe (and all the passes, structured buffers, compiling, binding, weird errors, and synchronization) is a huge PITA as compared to the convenience of CUDA. Compilers are way less mature, especially on some platforms cough. Some GPU capabilities are not exposed. No real composability or libraries. No proper debugging.
These days, some game engines have done pretty well at making compute shaders easy to use (such as Bevy [1] -- disclaimer, I contribute to that engine). But telling the scientific/financial/etc. community that they need to run their code inside a game engine to get a decent experience is a hard sell. It's not a great situation compared to how easy it is on NVIDIA's stack.

[1]: https://github.com/bevyengine/bevy/blob/main/examples/shader...

I have recently published an AI-related open-source project entirely based on compute shaders https://github.com/Const-me/Cgml and I’m super happy with the workflow. Possible to implement very complicated things without compiling a single line of C++, the software is mostly in C#.

> setting up the graphics pipe

I’ve picked D3D11, as opposed to D3D12 or Vulkan. The 11 is significantly higher level, and much easier to use.

> compiling, binding

The compiler is design-time, I ship them compiled, and integrated into the IDE. I solved the bindings with a simple code generation tool, which parses HLSL and generates C#.

> No proper debugging

I partially agree but still, we have renderdoc.

I understand why you've picked D3D11, but people have to understand that comes with serious limitations. There are no subgroups, which also means no cooperative matrix multiplication ("tensor cores"). For throughput in machine learning inference in particular, there's no way D3D11 can compete with either CUDA or a more modern compute shader stack, such as one based on Vulkan 1.3.
> no subgroups

Indeed, in D3D they are called “wave intrinsics” and require D3D12. But that’s IMO a reasonable price to pay for hardware compatibility.

> no cooperative matrix multiplication

Matrix multiplication compute shader which uses group shared memory for cooperative loads: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

> tensor cores

When running inference on end-user computers, for many practical applications users don’t care about throughput. They only have a single audio stream / chat / picture being generated, their batch size is a small number often just 1, and they mostly care about latency, not throughput. Under these conditions inference is guaranteed to bottleneck on memory bandwidth, as opposed to compute. For use cases like that, tensor cores are useless.

> there's no way D3D11 can compete with either CUDA

My D3D11 port of Whisper outperformed original CUDA-based implementation running on the same GPU: https://github.com/Const-me/Whisper/

Sure. It's a tradeoff space. Gain portability and ergonomics, lose throughput. For applications that are throttled by TOPS at low precisions (ie most ML inferencing) then the performance drop from not being able to use tensor cores is going to be unacceptable. Glad you found something that works for you, but it certainly doesn't spell the end of CUDA.
This. It's crazy how primitive the GPU development process still is in the year 2023. Yeah it's gotten better, but there's still a massive gap with traditional development.
It's kinda like building Legos vs building actual Skyscrapers. The gap between compute shaders and CUDA is massive. At least it feels massive because CUDA has some key features that compute shaders lack, and which make it so much easier to build complex, powerful and fast applications.

One of the features that would get compute shaders far ahead compared to now would be pointers and pointer casting - Just let me have a byte buffer and easily cast the bytes to whatever I want. Another would be function pointers. These two are pretty much the main reason I had to stop doing a project in OpenGL/Vulkan, and start using CUDA. There are so many more, however, that make life easier like cooperative groups with device-wide sync, being able to allocate a single buffer with all the GPU memory, recursion, etc.

Khronos should start supporting C++20 for shaders (basically what CUDA is) and stop the glsl or spirv nonsense.

You might argue for forking off from glsl and SPIR-V for complex compute workloads, but lightweight, fast compilers for a simple language like glsl do solve issues for graphics. Some graphics use cases don't get around shipping a shader compiler to the user. The number of possible shader configurations is often either insanely large or just impossible to enumerate, so on the fly compilation is really the only thing you can do.
Ironically, most people use HLSL with Vulkan, because Khronos doesn't have a budget nor the people to improve GLSL.

So yet another thing where Khronos APIs are dependent on DirectX evolution.

It used to be that AMD and NVidia would first implement new stuff on DirectX in collaboration with Microsoft, have them as extensions in OpenGL, and eventually as standard features.

Now even the shading language is part of it.

For GPGPU tasks, they lack a lot of useful features that CUDA has like the ability to allocate memory and launch kernels from the GPU. They also generally require you to write your GPU and CPU portions of an algorithm in different languages, while CUDA allows you to intermix your code and share data structures and simple functions between the two.
CUDA = C++ on GPUs. Compute shader - subset of C with a weird quirks.
There are existing efforts to compile SYCL to Vulkan compute shaders. Plenty of "weird quirks" involved since they're based on different underlying varieties of SPIR-V ("kernels" vs. "shaders") and seem to have evolved independently in other ways (Vulkan does not have the amount of support for numerical computation that OpenCL/SYCL has) - but nothing too terrible or anything that couldn't be addressed by future Vulkan extensions.
A subset that lacks pointers, which makes compute shaders a toy language next to CUDA.
Vulkan 1.3 has pointers, thanks to buffer device address[1]. It took a while to get there, and earlier pointer support was flawed. I also don't know of any major applications that use this.

Modern Vulkan is looking pretty good now. Cooperative matrix multiplication has also landed (as a widely supported extension), and I think it's fair to say it's gone past OpenCL.

Whether we get significant adoption of all this I think is too early to say, but I think it's a plausible foundation for real stuff. It's no longer just a toy.

[1] https://community.arm.com/arm-community-blogs/b/graphics-gam...

Is IREE the main runtime doing Vulkan or are there others? Who should we be listening to (oh wise @raphlinus)?

It's been awesome seeing folks like Keras 3.0 kicking out broad Intercompatibility across JAX, TF, Pytorch, powered by flexible executuon engines. Looking forward to seeing more Vulkan based runs getting socialized benchmarked & compared. https://news.ycombinator.com/item?id=38446353

The two I know of are IREE and Kompute[1]. I'm not sure how much momentum the latter has, I don't see it referenced much. There's also a growing body of work that uses Vulkan indirectly through WebGPU. This is currently lagging in performance due to lack of subgroups and cooperative matrix mult, but I see that gap closing. There I think wonnx[2] has the most momentum, but I am aware of other efforts.

[1]: https://kompute.cc/

[2]: https://github.com/webonnx/wonnx

> Vulkan 1.3 has pointers, thanks to buffer device address[1].

> [1] https://community.arm.com/arm-community-blogs/b/graphics-gam...

"Using a pointer in a shader - In Vulkan GLSL, there is the GL_EXT_buffer_reference extension "

That extension is utter garbage. I tried it. It was the last thing I tried before giving up on GLSL/Vulkan and switching to CUDA. It was the nail in the coffin that made me go "okay, if that's the best Vulkan can do, then I need to switch to CUDA". It's incredibly cumbersome, confusing and verbose.

What's needed are regular, simple, C-like pointers.

Compute shaders are not capable of using modern GPU features like tensor cores or many of the other features needed to feed tensor cores data fast enough (e.g. TMA/cp.async.shared)