|
|
|
|
|
by JonChesterfield
554 days ago
|
|
Wonderfully you don't need to trust my words, you've got my code :) If semantics are different, that's a bug/todo. It'll have worse latency than a CPU thread making the same kernel request. Throughput shouldn't be way off. The GPU writes some integers to memory that the CPU will need to read, and then write other integers, and then load those again. Plus whatever the x64 syscall itself does. That's a bunch of cache line invalidation and reads. It's not as fast as if the hardware guys were on board with the strategy but I'm optimistic it can be useful today and thus help justify changing the hardware/driver stack. The whole point of libc is to paper over the syscall interface. If you start from musl, "syscall" can be a table of function pointers or asm. Glibc is more obstructive. This libc open codes a bunch of things, with a rpc.h file dealing with synchronising memcpy of arguments to/from threads running on the CPU which get to call into the Linux kernel directly. It's mainly carefully placed atomic operations to keep the data accesses well defined. There's also nothing in here which random GPU devs can't build themselves. The header files are (now) self contained if people would like to use the same mechanism for other functionality and don't want to handroll the data structure. The most subtle part is getting this to work correctly under arbitrary warp divergence on volta. It should be an out of the box thing under openmp early next year too. |
|
My friend it's so incredibly bold of you to claim credit for this work when
1. Joe presented it
2. Joe's name is the only name on the git blame
3. I know Joe and I know he did the lion's share of the work
And so I'll repeat: Joe himself calls it rpc so I'm gonna keep calling it rpc and not syscall.