We wrote entire NVIDIA, AMD, and QCOM drivers in that style.
https://github.com/tinygrad/tinygrad/blob/master/tinygrad/ru...
Those drivers are faster than anything else when used to run fixed command queues (what neural network runs are)