| HN Mirror

This is a great question. In the sense that ROCm is pure userspace it's never necessary - make the syscalls yourself and the driver in the Linux kernel will do the same things ROCm would have done.

In practice if you go down that road on discrete GPU systems, allocating "fine grain" memory so you can talk to the GPU is probably the most tedious part of the setup. I gave up around there. An APU should be indifferent to that though.

There will be some setup to associate your CPU process with the GPU. Permissions style, since Linux doesn't let processes stomp on each other. That might be rather minimal and should be spelled out in roct.

Launching a kernel involves finding the part of the address space the GPU is watching, writing 64 bytes to it and then "ringing a doorbell" which is probably writing to a different magic address. There's a lot of cruft in the API from earlier generations where these things involved a lot of work.

Game plan for finding out goes something like:

  1. Compile some GPU code and put it in the host processs
  2. Make the calls into hsa.h to run that kernel
  3. Delete everything unused from hsa to get an equivalent that only uses roct
  4. Delete everything unused from roct to get the raw syscalls

Roct is a small C library that implements the userspace side of the kernel driver. I'd be inclined to link it into your application instead of drop it entirely, but ymmv. Rocr / HSA is a larger C++ library that has a lot more moving parts and is more tempting to drop from the dependency graph.

Going beyond that, you could build a simplified version of the kernel driver that drops all the other hardware. Might make things better, might not. And beyond that there's the firmware on the GPU which might be getting more accessible soon, but iiuc is written in assembly so might not be that much fun to hack on. And beyond that you're on the silicon, where changing it is making a different chip really.