"Currently, I am working on [...] direct CUDA implementation, which will be significantly faster and probably come close to PyTorch."
Although I wonder if it would work well with GCC PTX OMP offloading.
"Currently, I am working on [...] direct CUDA implementation, which will be significantly faster and probably come close to PyTorch."