| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JonChesterfield 554 days ago

> Genuinely, why?

> ... this seems like a technology that raises the floor not the ceiling of what is possible.

The root cause reason for this project existing is to show that GPU programming is not synonymous with CUDA (or the other offloading languages).

It's nominally to help people run existing code on GPUs. Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets. This is obvious to the implementation but seems largely missed by application developers. Lots of people think GPUs can only do floating point math.

Especially on an APU, where the GPU units and the CPU cores can hammer on the same memory, it is a travesty to persist with the "offloading to accelerator" model. Raw C++ isn't an especially sensible language to program GPUs in but it's workable and I think it's better than CUDA.

3 comments

krackers 554 days ago

>Disregarding that use case, it shows that GPUs can actually do things like fprintf or open sockets.

Can you elaborate on this? My mental model of GPU is basically like a huge vector coprocessor. How would things like printf or sockets work directly from the GPU when they require syscalls to trap into the OS kernel? Given that the kernel code is running on the CPU, that seems to imply that there needs to be a handover at some point. Or conversely even if there was unified memory and the GPU could directly address memory-mapped peripherals, you'd basically need to reimplement drivers wouldn't you?

JonChesterfield 554 days ago

It's mostly terminology and conventions. On the standard system setup, the linux kernel running in a special processor mode does these things. Linux userspace asks the kernel to do stuff using syscall and memory which both kernel and userspace can access. E.g. the io_uring register followed by writing packets into the memory.

What the GPU has is read/write access to memory that the CPU can also access. And network peripherals etc. You can do things like alternately compare-and-swap on the same page from x64 threads and amdgpu kernels and it works, possibly not quickly on some systems. That's also all that the x64 CPU threads have though, modulo the magic syscall instruction to ask the kernel to do stuff.

People sometimes get quite cross at my claim that the GPU can do fprintf. Cos actually all it can do is write numbers into shared memory or raise interrupts such that the effect of fprintf is observed. But that's also all the userspace x64 threads do, and this is all libc anyway, so I don't see what people are so cross about. You're writing C, you call `fprintf(stderr, "Got to L42\n");` or whatever, and you see the message on the console.

If fprintf compiles into a load of varargs mangling with a fwrite underneath, and the varargs stuff runs on the GPU silicon and the fwrite goes through a staging buffer before some kernel thread deals with it, that seems fine.

I'm pretty sure you could write to an nvme drive directly from the gpu, no talking to the host kernel at all, at which point you've arguably implemented (part of?) a driver for it. You can definitely write to network cards from them, without using any of this machinery.

saagarjha 553 days ago

We don't actually allow a GPU to directly fprintf, because GPU can't syscall. Only userspace can do that. You can have userspace keep polling and then do it on behalf of the GPU, but that's not the GPU doing it.

adrian_b 553 days ago

The GPU could do the equivalent of fprintf, if the concerned peripherals used only memory-mapped I/O an the IOMMU would be configured to allow the GPU to access directly those peripherals, without any involvement from the OS kernel that runs on the CPU.

This is the same as on the CPU, where the kernel can allow a user process to access directly a peripheral, without using system calls, by mapping that peripheral in the memory space of the user process.

In both cases the peripheral must be assigned exclusively to the GPU or the user process. What is lost by not using system calls is the ability to share the peripheral between multiple processes, but the performance for the exclusive user of the peripheral can be considerably increased. Of course, the complexity of the user process or GPU code is also increased, because it must include the equivalent of the kernel device driver for that peripheral.

jhuber6 553 days ago

At some point I was looking into using io_uring for something like this. The uring interface just works off of `mmap()` memory, which can be registered with the GPU's MMU. There's a submission polling setting, which means that the GPU can simply write to the pointer and the kernel will eventually pick up the write syscall associated with it. That would allow you to use `snprintf` locally into a buffer and then block on its completion. The issue is that the kernel thread goes to sleep after some time, so you'd still need a syscall from the GPU to wake it up. AMD GPUs actually support software level interrupts which could be routed to a syscall, but I didn't venture too deep down that rabbit hole.

fulafel 553 days ago

File I/O would be a can of worms. But "i want to use fprintf" specifically: Stdio files don't need to be backed by Unix FD's. See eg https://www.gnu.org/software/libc/manual/html_node/Other-Kin... an eg fmemopen().

rbanffy 554 days ago

> Lots of people think GPUs can only do floating point math.

IIRC, every Raspberry Pi is brought up by the GPU setting up the system before the CPU is brought out of reset and the bootloader looks for the OS.

> it is a travesty to persist with the "offloading to accelerator" model.

Operating systems would need to support heterogeneous processors running programs with different ISAs accessing the same pools of memory. I'd LOVE to see that. It'd be extremely convenient to have first-class processes running on the GPU MIMD cores.

I'm not sure there is much research done in that space. I believe IBM mainframe OSs have something like that because programmers are exposed to the various hardware assists that run as coprocessors sharing the main memory with the OS and applications.

als0 554 days ago

> I'm not sure there is much research done in that space.

There is. And the finest example I can think of is Barrelfish https://barrelfish.org

rbanffy 554 days ago

Interesting - it resembles a network of heterogeneous systems that can share a memory space used primarily for explicit data exchange. Not quite what I was imagining, but probably much simpler to implement than a Unix where the kernel can see processes running on different ISAs on a shared memory space.

I guess hardware availability is an issue, as there aren't many computers with, say, an ARM, a RISC-V, an x86, and an AMD iGPU sharing a common memory pool.

OTOH, there are many where a 32-bit ARM shares the memory pool with 64-bit cores. Usually the big cores run applications while the small ARM does housekeeping or other low-latency task.

als0 554 days ago

> Not quite what I was imagining, but probably much simpler to implement than a Unix where the kernel can see processes running on different ISAs on a shared memory space.

Indeed. The other argument is that treating the computer as a distributed system can make it scale better to say hundreds of cores compared to a lock-based SMP system.

rbanffy 553 days ago

> treating the computer as a distributed system

Sure, but where's the fun in that?

Up to GPGPUs, there was no reason to build a machine with multiple CPUs of different architectures except running different OSs on them (such as the Macs, Suns and Unisys mainframes with x86 boards for running Windows side-by-side with a more civilized OS). With GPGPUs you have machines with a set of processors that are good on many things, but not great at SIMD and one that's awesome at SIMD, but sucks for most other things.

And, as I mentioned before, there are lots of ARM machines with 64-bit and ultra-low-power 32-bit cores sharing the same memory map. Also, even x86 variants with different ISA extensions can be treated as different architectures by the OS - Intel had to limit the fast cores of its early asymmetric parts because the low-power cores couldn't do AVX512 and OSs would not support migrating a process to the right core on an invalid instruction fault.

saagarjha 553 days ago

The problem is that GPUs are kind of bad at being general-purpose, so it doesn't really make sense to expose the hardware that way.

rbanffy 553 days ago

If the OS supports it, you can make programs that start threads on CPUs and GPUs and let those communicate. You run the SIMD-ish functions on the GPUs and the non-SIMD-heavy functions on the CPU cores.

I have a strong suspicion GPUs aren't as bad at general-purpose stuff as we perceive and we underutilize them because it's inconvenient to shuttle data over an architectural wall that's not really there in iGPUs.

Maybe it doesn't make sense, but it'd be worth looking into just to know where the borders of the problem lie.

saagarjha 550 days ago

Nah, they're pretty bad. They don't speculate or prefetch nearly as well as CPUs, and most code kind of relies on that to be fast. If you are programming for a GPU and you want to go fast you generally have to work quite hard for it.

einpoklum 554 days ago

> The root cause reason for this project existing is to show that GPU > programming is not synonymous with CUDA (or the other offloading > languages).

1. The ability to use a particular library does not reflect much on which languages can be used.

2. One you have PTX as a backend target for a compiler, obviously you can use all sorts of languages on the frontend - which NVIDIA's drivers and libraries won't even know about. Or you can just use PTX as your language - making your point that GPU programming is not synonymous with CUDA C++.

> It's nominally to help people run existing code on GPUs.

I'm worried you might be right. But - we should really not encourage people to run existing CPU-side code on GPUs, that's rarely (or maybe never?) a good idea.

> Raw C++ isn't an especially sensible language to program GPUs in > but it's workable and I think it's better than CUDA.

CUDA is an execution ecosystem. The programming language for writing kernel code is "CUDA C++", which _is_ C++, plus a few builtins functions ... or maybe I'm misunderstanding this sentence.

JonChesterfield 554 days ago

GPU offloading languages - cuda, openmp etc - work something like:

1. Split the single source into host parts and gpu parts

2. Optionally mark up some parts as "kernels", i.e. have entry points

3. Compile them separately, maybe for many architectures

4. Emit a bunch of metadata for how they're related

5. Embed the GPU code in marked up sections of the host executable

6. Embed some startup code to find GPUs into the x64 parts

7. At runtime, go crawling around the elf section launching kernels

This particular library (which happens to be libc) is written in C++, compiled with ffreestanding target=amdgpu, to LLVM bitcode. If you build a test, it compiles to an amdgpu elf file - no x64 code in it, no special metadata, no elf-in-elf structure. The entry point is called _start. There's a small "loader" program which initialises hsa (or cuda) and passes it the address of _start.

I'm not convinced by the clever convenience cut-up-and-paste-together style embraced by cuda or openmp. This approach brings the lack of magic to the forefront. It also means we can add it to openmp etc when the reviews go through so users of that suddenly find fopen works.

einpoklum 554 days ago

CUDA C++ _can_ work like that. But I would say that these are mostly kiddie wheels for convenience. And because, in GPU programming, performance is king, most (?) kernel developers are likely to eventually need to drop those wheels. And then:

* No single source (although some headers might be shared)

* Kernels are compiled and linked at runtime, for the platform you're on, but also, in the general case, with extra definitions not known apriori (and which are different for different inputs / over the course of running your program), and which have massive effect on the code.

* You may or may not use some kind of compiled kernel caching mechanism, but you certainly don't have all possible combinations of targets and definitions available, since that would be millions or compiled kernels.

It should also be mentioned that OpenCL never included the kiddie wheels to begin with; although I have to admit it makes it less convenient to start working with.