We don't actually allow a GPU to directly fprintf, because GPU can't syscall. Only userspace can do that. You can have userspace keep polling and then do it on behalf of the GPU, but that's not the GPU doing it.
The GPU could do the equivalent of fprintf, if the concerned peripherals used only memory-mapped I/O an the IOMMU would be configured to allow the GPU to access directly those peripherals, without any involvement from the OS kernel that runs on the CPU.
This is the same as on the CPU, where the kernel can allow a user process to access directly a peripheral, without using system calls, by mapping that peripheral in the memory space of the user process.
In both cases the peripheral must be assigned exclusively to the GPU or the user process. What is lost by not using system calls is the ability to share the peripheral between multiple processes, but the performance for the exclusive user of the peripheral can be considerably increased. Of course, the complexity of the user process or GPU code is also increased, because it must include the equivalent of the kernel device driver for that peripheral.
At some point I was looking into using io_uring for something like this. The uring interface just works off of `mmap()` memory, which can be registered with the GPU's MMU. There's a submission polling setting, which means that the GPU can simply write to the pointer and the kernel will eventually pick up the write syscall associated with it. That would allow you to use `snprintf` locally into a buffer and then block on its completion. The issue is that the kernel thread goes to sleep after some time, so you'd still need a syscall from the GPU to wake it up. AMD GPUs actually support software level interrupts which could be routed to a syscall, but I didn't venture too deep down that rabbit hole.
This is the same as on the CPU, where the kernel can allow a user process to access directly a peripheral, without using system calls, by mapping that peripheral in the memory space of the user process.
In both cases the peripheral must be assigned exclusively to the GPU or the user process. What is lost by not using system calls is the ability to share the peripheral between multiple processes, but the performance for the exclusive user of the peripheral can be considerably increased. Of course, the complexity of the user process or GPU code is also increased, because it must include the equivalent of the kernel device driver for that peripheral.