Hacker News new | ask | show | jobs
by suuuuuuuu 62 days ago
If you think C++ is the best here, then I don't think you've actually worked in this space nor appreciated the actual problems these languages try to solve. In particular because you can't program accelerators with C++.

Memory bandwidth is often the problem, yes. Language abstractions for performance aim to, e.g., automatically manage caches (that must be handled manually in performant GPU code, for instance) with optimized memory tiling and other strategies. Kernel fusion is another nontrivial example that improves effective bandwidth.

Adding on the diversity of hardware that one needs to target (both within and among vendors), i.e., portability not just of function but of performance, makes the need for better tooling abundantly obvious. C++ isn't even an entrant in this space.

2 comments

What?!?

NVidia designs CUDA hardware specifically for the C++ memory model, they went through the trouble to refactor their original hardware across several years, so that all new cards would follow this model, even if PTX was designed as polyglot target.

Additionally, ISO C++ papers like senders/receivers are driven by NVidia employees working on CUDA.

CUDA is not C++. CUDA for GPU kernels is its own language. That's the actual problem requiring new languages or abstractions.
Says those that don't know CUDA.

You can program CUDA in standard C++20, with CUDA libraries hidding the language extensions.

I love when C and C++ dialects are C and C++ when it matters, and not when it doesn't help to sell the ideas being portrayed.

Sorry, I wasn't aware of these developments (having abandoned CUDA for hardware-agnostic solutions before 2020). It doesn't change my point anyway, if it's specific to a single vendor.

I'm extremely dubious that such an opaque abstraction can actually solve the (true) problem. "Not having to write CUDA" is not enough - how do you tune performance? Parallelization strategies, memory prefetching and arrangement in on-chip caches, when to fuse kernels vs. not... I don't doubt the compiler can do these things, but I do doubt that it can know at compile time what variants of kernel transformations will optimize performance on any given hardware. That's the real problem: achieving an abstraction that still gives one enough control to achieve peak performance.

Edit: you tell me if I'm wrong, but it seems that std::par can't even use shared memory, let alone let one control its usage? If so, then my point stands: C++ is not remotely relevant. Again, avoiding writing CUDA (etc.) doesn't solve the real problem that high-performance language abstractions aim to address.

So what would be such an HPC language that you're so fond of? A quick web search reveals only languages that use C++/CUDA code as a back end (python), are new and experimental (Julia) or FORTRAN. For what you're talking about none seem all to good, so you've peaked my curiosity.
See https://arxiv.org/abs/2512.17101. I've used some of the tools in the stack they describe (and see Sec 2 for an overview of others). JAX/XLA/etc. are somewhat similar, though still without user control over transformations.

Perhaps part of the reason for the bad takes in this thread is due to taking "language" overly literally (perhaps also the fault of the linked blog post itself). I think one thesis of the above tooling is that, when tuning and generating code (CUDA, OpenCL, what have you) at runtime, the best "languages" for these abstractions are, amusingly, scripting languages like Python. Having CUDA/etc. as a back end without having to hand-write/-transform/-optimize it is indeed the point.

If CUDA is C++ then I'd like to know how you throw and catch exceptions in CUDA kernels.
The same way as people writing C++ code as Google employees do, including LLVM and Chrome.
Wait what!? I have been programming CUDA since 2009 and specifically remember it being pushed to C++ as main development language for the first few years, after a brief "CUDA C extension" period.
CUDA variants extend several programming languages, including C, C++ and Fortran.

None of the extended languages is the same as the same as the base language, in the same way like OpenMP C++ in not the same as C++ or OpenMP Fortran is not the same as Fortran or SYCL is not the same as C++.

The extended languages include both extensions and restrictions of the base language. In the part of a program that will run on a GPU you can do things that cannot be done in the base language, but there also parts of the base language, e.g. of C++, which are forbidden.

All these extended languages have the advantage that you can write in a single source file a complete multithreaded program, which has parts running concurrently on a CPU and part running concurrently on a GPU, but for the best results you must know the rules that apply to the language accepted by each of them. It is possible to write program that run without modification on either a CPU or a GPU, but this is paid by a lower performance on any of them, because such a program uses only generic language features that work on any of them, instead of taking advantage of specific features.

CUDA is not C++. CUDA for GPU kernels is its own language. That's the actual problem requiring new languages or abstractions.