Libc on x64 is roughly a bunch of userspace code over syscall which traps into the kernel. Looks like a function that takes six integer registers and writes results to some of those same registers.
Libc on nvptx or amdgpu is a bunch of userspace code over syscall, which is a function that takes eight integers per lane on the GPU. That "syscall" copies those integers to the x64/host/other architecture. You'll find it in a header called rpc.h, the same code compiled on host or GPU. Sometime later a thread on the host reads those integers, does whatever they asked for (e.g. call the host syscall on the next six integers), possibly copies values back.
Puts probably copies the string to the host 7*8 bytes at a time, reassembles it on the host, then passes it to the host implementation of puts. We should be able to kill the copy on some architectures. Some other functions run wholly on the GPU, e.g. sprintf shouldn't talk to the host, but fprintf will need to.
The GPU libc is fun from a design perspective because it can run code on either side of that communication channel as we see fit. E.g. printf floating point handling seems prone to large numbers of registers needed on the GPU at the moment so we may move some work to the host to make the register usage better (higher occupancy).
Libc on nvptx or amdgpu is a bunch of userspace code over syscall, which is a function that takes eight integers per lane on the GPU. That "syscall" copies those integers to the x64/host/other architecture. You'll find it in a header called rpc.h, the same code compiled on host or GPU. Sometime later a thread on the host reads those integers, does whatever they asked for (e.g. call the host syscall on the next six integers), possibly copies values back.
Puts probably copies the string to the host 7*8 bytes at a time, reassembles it on the host, then passes it to the host implementation of puts. We should be able to kill the copy on some architectures. Some other functions run wholly on the GPU, e.g. sprintf shouldn't talk to the host, but fprintf will need to.
The GPU libc is fun from a design perspective because it can run code on either side of that communication channel as we see fit. E.g. printf floating point handling seems prone to large numbers of registers needed on the GPU at the moment so we may move some work to the host to make the register usage better (higher occupancy).