Hacker News new | ask | show | jobs
by JonChesterfield 554 days ago
Wonderfully you don't need to trust my words, you've got my code :)

If semantics are different, that's a bug/todo. It'll have worse latency than a CPU thread making the same kernel request. Throughput shouldn't be way off. The GPU writes some integers to memory that the CPU will need to read, and then write other integers, and then load those again. Plus whatever the x64 syscall itself does. That's a bunch of cache line invalidation and reads. It's not as fast as if the hardware guys were on board with the strategy but I'm optimistic it can be useful today and thus help justify changing the hardware/driver stack.

The whole point of libc is to paper over the syscall interface. If you start from musl, "syscall" can be a table of function pointers or asm. Glibc is more obstructive. This libc open codes a bunch of things, with a rpc.h file dealing with synchronising memcpy of arguments to/from threads running on the CPU which get to call into the Linux kernel directly. It's mainly carefully placed atomic operations to keep the data accesses well defined.

There's also nothing in here which random GPU devs can't build themselves. The header files are (now) self contained if people would like to use the same mechanism for other functionality and don't want to handroll the data structure. The most subtle part is getting this to work correctly under arbitrary warp divergence on volta. It should be an out of the box thing under openmp early next year too.

1 comments

> Wonderfully you don't need to trust my words, you've got my code :)

My friend it's so incredibly bold of you to claim credit for this work when

1. Joe presented it

2. Joe's name is the only name on the git blame

3. I know Joe and I know he did the lion's share of the work

And so I'll repeat: Joe himself calls it rpc so I'm gonna keep calling it rpc and not syscall.

The RPC implementation in LLVM is an adaptation of Jon's original state machine (see https://github.com/JonChesterfield/hostrpc). It looks very different at this point, but we collaborated on the initial design before I fleshed out everything else. Syscall or not is a bit of a semantic argument, but I lean more towards syscall 'inspired'.
Here's the algorithm https://doi.org/10.1145/3458744.3473357. My paper with Joseph on the implementation is at https://doi.org/10.1007/978-3-031-40744-4_15.

The syscall layer this runs on was written at https://github.com/JonChesterfield/hostrpc, 800 commits from May 2020 until Jan 2023. I deliberately wrote that in the open, false paths and mistakes and all. Took ages for a variety of reasons, not least that this was my side project.

You'll find the upstream of that scattered across the commits to libc, mostly authored by Joseph (log shows 300 for him, of which I reviewed 40, and 25 for me). You won't find the phone calls and offline design discussions. You can find the tricky volta solution at https://reviews.llvm.org/D159276 and the initial patch to llvm at https://reviews.llvm.org/D145913.

GPU libc is definitely Joseph's baby, not mine, and this wouldn't be in trunk if he hadn't stubbornly fought through the headwinds to get it there. I'm excited to see it generating some discussion on here.

But yeah, I'd say the syscall implementation we're discussing here has my name adequately written on it to describe it as "my code".